Closed Dominic-DallOsto closed 2 years ago
I'm not sure creative_work covers all the types we care about?
For the typed query, wikidata.js
's reconcile
function uses a Zotero item type → QID map. This map is derived from Zotero's QuickStatements translator. Maybe we could run a SPARQL query in Wikidata to identify the most specific common ancestor (i.e., parent class) for all QIDs in the map. If it is creative_work
or something more specific, I think we should be OK with using creative_work
. What do you think?
could we send the typed query first, then only send the untyped query if the typed one doesn't return anything?
Probably this was the original idea, as described here. I must have included both in the same request to minimize the number of requests. It should be OK to split them and only make a second untyped request for any query which returned zero exact matches for the typed request.
I think I correctly wrote a SPARQL query here which takes all the types from the translator and tries to find their common superclasses (the commented out types don't work because they're redirects I think). Unfortunately the only one is entity, which is even less restrictive than work. I think the issue is that things like video recording aren't subclasses of work.
Alternatively, can we run the untyped query with no type in it at all? Then there shouldn't be a problem with openrefine and once #52 is resolved it should be easy for users to disambiguate anyway.
I just tried the second option and it works nicely.
I benchmarked it, getting QIDs for all 43 items in #141. It took 9.3 s to find exact matches for 29 of these items. So very roughly it doesn't seem not specifying a type is making the reconciliation query super slow.
Great query! In this query, derived from yours, I changed the redirects for the items they redirect to. Only three classes (commented out) do not have work
as superclass: legal hearing (Q545861), instant message (Q30070565), video recording (Q30070675).
A work
is decribed as a "physical or virtual object made by humans". I wonder whether instant message and video recording have not been made subclasses of work
because they may be made by non-humans as well. But given that email (Q30070439) and audio recording (Q3302947) do have work
as superclass, we may argue in Wikidata that instant message and video recording should too. I wonder whether we could make a similar point between legal hearing (Q545861) and interview (Q178651) or presentation (Q604733).
Anyway, that wouldn't take us any lower in the hierarchy than work
.
Alternatively, can we run the untyped query with no type in it at all? Then there shouldn't be a problem with openrefine and once #52 is resolved it should be easy for users to disambiguate anyway.
I think that every time we deal with the reconcile
function we should consider the pros and cons of being too strict or too flexible. As pointed out in #101:
there seem to be too opposing tensions in play here: being too stringent about item matching, which may result in users creating duplicates in Wikidata, and being too flexible about it, which may result in Cita assigning wrong QIDs to items (what happened in your bug report). Being strict for exact matches, and leaving room for flexible matching as partial matches (i.e., suggestions) may be a good middle point.
I think your suggestion may be a good one. Probably it never occurred to me that running an unrestricted query would take longer than running a restricted one! Let's go that way, but make sure we continue treating untyped matches as partial matches.
Just did a proper test with the same setup, running each twice:
type: creative_work
- 6.3 s and 6.3 sSo it seems the measurements are a bit noisy, but it's not making much of a difference. From looking at openrefine it seems the type filtering happens after they search for the title, so not having a type shouldn't slow down the query...
I benchmarked it, getting QIDs for all 43 items in #141. It took 9.3 s to find exact matches for 29 of these items. So very roughly it doesn't seem not specifying a type is making the reconciliation query super slow.
Now that you mention this, given that:
reconcile
with options.partial = false
(see #90), options.partial === false
.What do you think about not doing the untyped queries at all if options.partial === false
?
I think your suggestion may be a good one. Probably it never occurred to me that running an unrestricted query would take longer than running a restricted one! Let's go that way, but make sure we continue treating untyped matches as partial matches.
Yeah, if untyped matches are always only partial matches that a user has to check, I think it should be OK to be slightly less restrictive.
What do you think about not doing the untyped queries at all if options.partial === false?
Ahh yes, that makes sense!
Cool! If you could add that, it'd be great.
Yep, will do! I had a quick look but it's not trivial because the code at the moment assumes we always have the untyped query results so I'd have to restructure things a little bit. This might take me a couple of days, so I don't know if you'd prefer to merge this to restore QID fetching functionality, then make a new issue to implement this change? Whatever you think.
Sure, sounds reasonable. I'll create a separate issue now. Thank you!!
Fixes #149
More of a hack than a proper fix but at least it gets things working again. I'm not sure
creative_work
covers all the types we care about?As a further change - could we send the typed query first, then only send the untyped query if the typed one doesn't return anything?