Open dpriskorn opened 9 months ago
Thanks for keeping an eye on this, @dpriskorn ! The link in your post does not work for me, so here is another link to what is probably the same message.
We are looking into the matter and do not have good answers to your questions yet, but here are some guesstimates:
main
or scholarly
endpoints and (b) give the same results as the full
endpoint. This could probably be largely automated in a matter of hours by someone who understands the matter.I have tried one here: https://synia.toolforge.org/#author/Q18618629 just changing the endpoint. There were issues ("Recent publications from experimental scholarly endpoint ")
just changing the endpoint
@fnielsen, I think the split will mean federated SPARQL queries over the two servers. Did you try that already? Where can I find the SPARQL of the Recent publications from experimental scholarly endpoint
? I could not spot the link to the matching query service. Does it not have a QS for the individual endpoints yet? That would make development a lot more difficult.
Can all of them be rewritten without adverse effects like timeout?
@dpriskorn, no, I don't think so. This initial split is suffering from the problem we highlighted in a telcon last year: queries break and cannot be easily solved with SPARQL. the key problem is that statements (like P2860) have object and subject split over the two QS-s... this will require to figure out which statements have content in both (multiple) QS-s, then do a fusion of that data, before moving to the next statement
Example query that returns empty is this one: https://w.wiki/98JL
I just tried rewriting it, but it's nasty because essential info is split over the two resources (to be run at https://query-scholarly-experimental.wikidata.org/):
select ?year (count(distinct ?work) as ?number_of_publications) ?type_ ?role where {
# get the intention types from the "main" WDQS
SERVICE <https://query-main-experimental.wikidata.org/sparql> {
?intention wdt:P31 wd:Q96471816 .
}
# get the citing works from the "main" WDQS
{
SERVICE <https://query-main-experimental.wikidata.org/sparql> {
select distinct ?work (min(?years) as ?year) ?type_ where {
?work wdt:P577 ?dates ;
p:P2860 / pq:P3712 ?intention .
bind(str(year(?dates)) as ?years) .
OPTIONAL {
?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
}
} group by ?work ?type_
}
}
UNION
# get the citing works from the "scholarly" WDQS
{
select distinct ?work (min(?years) as ?year) ?type_ where {
?work wdt:P577 ?dates ;
p:P2860 / pq:P3712 ?intention .
bind(str(year(?dates)) as ?years) .
OPTIONAL {
?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
}
}
group by ?work ?type_
}
hint:Prior hint:runFirst true .
# now look up some additional info (only available from the "main" WDQS
SERVICE <https://query-main-experimental.wikidata.org/sparql> {
?work wdt:P1433 ?venue_ . ?venue_ rdfs:label ?venue . FILTER (LANG(?venue) = "en")
MINUS { ?venue_ wdt:P31 wd:Q1143604 }
}
bind(
coalesce(
if(bound(?type_), ?venue,
'other source')
) as ?role
)
}
group by ?year ?type_ ?role
order by ?year
It times out.
When I run the query from main
I get closer, and it runs in reasonable time:
select ?year (count(distinct ?work) as ?number_of_publications) ?type_ ?venue_ ?role where {
# get the intention types from the "main" WDQS
?intention wdt:P31 wd:Q96471816 .
# get the articles from the "scholarly" WDQS
{
SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
select distinct ?work (min(?years) as ?year) ?type_ where {
?work wdt:P577 ?dates ;
p:P2860 / pq:P3712 ?intention .
bind(str(year(?dates)) as ?years) .
OPTIONAL {
?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
}
} group by ?work ?type_
}
}
UNION
# get the articles from the "main" WDQS
{
select distinct ?work (min(?years) as ?year) ?type_ where {
?work wdt:P577 ?dates ;
p:P2860 / pq:P3712 ?intention .
bind(str(year(?dates)) as ?years) .
OPTIONAL {
?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
}
}
group by ?work ?type_
}
hint:Prior hint:runFirst true .
# now look up some additional info: venue
# get the venue info from the "scholarly" WDQS
OPTIONAL {
?work wdt:P1433 ?venue_ . ?venue_ rdfs:label ?venue . FILTER (LANG(?venue) = "en")
MINUS { ?venue_ wdt:P31 wd:Q1143604 }
}
bind(
coalesce(
if(bound(?type_), ?venue,
'other source')
) as ?role
)
}
group by ?year ?type_ ?venue_ ?role
order by ?year
But you can see from the results that the venue
information is split over the two QS-s (the above query missed venue info). As soon as I try looking up the venue info from both instances, it times out again.
(crossposted from the Telegram Wikicite channel)
I just finished a query that shows how content is scattered over the two splits: https://w.wiki/98km One of the powers of SPARQL is to be able to search the linking ("web"), unlike, for example, label searching. But if we search for a link (the Statement in Wikidata terms), this becomes hard when those links are split too: you effectively have to search in both QSs. This is what I tried yesterday with https://github.com/WDscholia/scholia/issues/2423#issuecomment-1936978903 (above): but since SPARQL commonly includes a pattern of two and more links, this is not trivial at all. Indeed, I ran into timeouts. I do not think this is special for Scholia, but applies to any tool that uses SPARQL where Statements are split over the two instances. Of course, this query just looks at one direct claim, and the GitHub issue shows that "two or more" is with qualifiers.
Basically, splitting works if the content can be split. But the power of Wikidata is the complexity of human language, but then with machine readability. Qualifiers are all over the place. So, when i say, "that I feel that Wikidata has failed", more accurately I should say "the query service has failed" and that I think that the QS is a essential part of the eco system (also for Wikibase, for the matter). This is just opinion. Let me stress, the problems are real and we need a real solution. This real solution is hard. This splitting is not the first solution being sought. The Scholia project has been actively looking into alternatives, including a dedicated WDQS, a QS with a lag (but see notes about loading times being days, rather than hours), and the subsetting work (see https://content.iospress.com/articles/semantic-web/sw233491). It is compicated and 5 years ago I has naive and optimistic that computer science would develop a scalable triple store with SPARQL endpoint that meets the Wikidata needs. Sadly, the CS field did not live up to my hopes. So, my tears (":(") are real. And the scalability problems that Wikidata are seeing important and to me very serious and nothing to joke about.
Where can I find the SPARQL of the
Recent publications from experimental scholarly endpoint
?
https://synia.toolforge.org/#author/Q18618629 - third table
https://synia.toolforge.org/#author/Q18618629 - third table
Yes, got that :) But unlike the other tables, this one does not have a link to the matching query service. I wanted to see the SPARQL itself, not the results.
I think I should be able to find it in the Wiki itself, but the Synia setup I wrote was already too long ago that I can easily find it.
I wanted to see the SPARQL itself, not the results.
Thanks. Now that I have seen the query, I think that one runs into exactly the problem I experienced and tried to describe.
Instead of rewriting or bothering about the split, I suggest we focus on running QLever ourselves and improve it to do what we want no matter the growth of Wikidata. See the discussion I started https://github.com/WDscholia/scholia/discussions/2425
I discussed that briefly with @Daniel-Mietchen today. To me it seems that the one time split does conceptually not solve any scaling issues and should not be done in the way as it is currently planned. If done, it should be done transparently to the user, i.e. the query might be executed on different back-ends, but it should not be required to change the query.
i.e. the query might be executed on different back-ends, but it should not be required to change the query.
What I found is that this is not trivial at all: you cannot simply run a query on both endpoints and then merge the results.
see #2412 - for a mitigation path
Context
See https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/IIA5LVHBYK45FSMLPIVZI6WXA5QSRPF4/
Question
How many queries need to be rewritten? Can all of them be rewritten without adverse effects like timeout? How much effort is it to rewrite? Can the rewriting be automated somehow?