diegodlh / zotero-cita

Cita: a Wikidata addon for Zotero with citations metadata support
GNU General Public License v3.0
222 stars 10 forks source link

Remove base type `work` from untyped reconciliation query #150

Closed Dominic-DallOsto closed 2 years ago

Dominic-DallOsto commented 2 years ago

Fixes #149

More of a hack than a proper fix but at least it gets things working again. I'm not sure creative_work covers all the types we care about?

As a further change - could we send the typed query first, then only send the untyped query if the typed one doesn't return anything?

diegodlh commented 2 years ago

I'm not sure creative_work covers all the types we care about?

For the typed query, wikidata.js's reconcile function uses a Zotero item type → QID map. This map is derived from Zotero's QuickStatements translator. Maybe we could run a SPARQL query in Wikidata to identify the most specific common ancestor (i.e., parent class) for all QIDs in the map. If it is creative_work or something more specific, I think we should be OK with using creative_work. What do you think?

could we send the typed query first, then only send the untyped query if the typed one doesn't return anything?

Probably this was the original idea, as described here. I must have included both in the same request to minimize the number of requests. It should be OK to split them and only make a second untyped request for any query which returned zero exact matches for the typed request.

Dominic-DallOsto commented 2 years ago

I think I correctly wrote a SPARQL query here which takes all the types from the translator and tries to find their common superclasses (the commented out types don't work because they're redirects I think). Unfortunately the only one is entity, which is even less restrictive than work. I think the issue is that things like video recording aren't subclasses of work.

Alternatively, can we run the untyped query with no type in it at all? Then there shouldn't be a problem with openrefine and once #52 is resolved it should be easy for users to disambiguate anyway.

Dominic-DallOsto commented 2 years ago

I just tried the second option and it works nicely.

I benchmarked it, getting QIDs for all 43 items in #141. It took 9.3 s to find exact matches for 29 of these items. So very roughly it doesn't seem not specifying a type is making the reconciliation query super slow.

diegodlh commented 2 years ago

Great query! In this query, derived from yours, I changed the redirects for the items they redirect to. Only three classes (commented out) do not have work as superclass: legal hearing (Q545861), instant message (Q30070565), video recording (Q30070675).

A work is decribed as a "physical or virtual object made by humans". I wonder whether instant message and video recording have not been made subclasses of work because they may be made by non-humans as well. But given that email (Q30070439) and audio recording (Q3302947) do have work as superclass, we may argue in Wikidata that instant message and video recording should too. I wonder whether we could make a similar point between legal hearing (Q545861) and interview (Q178651) or presentation (Q604733).

Anyway, that wouldn't take us any lower in the hierarchy than work.

Alternatively, can we run the untyped query with no type in it at all? Then there shouldn't be a problem with openrefine and once #52 is resolved it should be easy for users to disambiguate anyway.

I think that every time we deal with the reconcile function we should consider the pros and cons of being too strict or too flexible. As pointed out in #101:

there seem to be too opposing tensions in play here: being too stringent about item matching, which may result in users creating duplicates in Wikidata, and being too flexible about it, which may result in Cita assigning wrong QIDs to items (what happened in your bug report). Being strict for exact matches, and leaving room for flexible matching as partial matches (i.e., suggestions) may be a good middle point.

I think your suggestion may be a good one. Probably it never occurred to me that running an unrestricted query would take longer than running a restricted one! Let's go that way, but make sure we continue treating untyped matches as partial matches.

Dominic-DallOsto commented 2 years ago

Just did a proper test with the same setup, running each twice:

So it seems the measurements are a bit noisy, but it's not making much of a difference. From looking at openrefine it seems the type filtering happens after they search for the title, so not having a type shouldn't slow down the query...

diegodlh commented 2 years ago

I benchmarked it, getting QIDs for all 43 items in #141. It took 9.3 s to find exact matches for 29 of these items. So very roughly it doesn't seem not specifying a type is making the reconciliation query super slow.

Now that you mention this, given that:

  1. a batch reconciliation will use reconcile with options.partial = false (see #90),
  2. untyped-query results will be forced to non-exact matches, and
  3. non-exact matches will be ignored if options.partial === false.

What do you think about not doing the untyped queries at all if options.partial === false?

Dominic-DallOsto commented 2 years ago

I think your suggestion may be a good one. Probably it never occurred to me that running an unrestricted query would take longer than running a restricted one! Let's go that way, but make sure we continue treating untyped matches as partial matches.

Yeah, if untyped matches are always only partial matches that a user has to check, I think it should be OK to be slightly less restrictive.

Dominic-DallOsto commented 2 years ago

What do you think about not doing the untyped queries at all if options.partial === false?

Ahh yes, that makes sense!

diegodlh commented 2 years ago

Cool! If you could add that, it'd be great.

Dominic-DallOsto commented 2 years ago

Yep, will do! I had a quick look but it's not trivial because the code at the moment assumes we always have the untyped query results so I'd have to restructure things a little bit. This might take me a couple of days, so I don't know if you'd prefer to merge this to restore QID fetching functionality, then make a new issue to implement this change? Whatever you think.

diegodlh commented 2 years ago

Sure, sounds reasonable. I'll create a separate issue now. Thank you!!