WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
225 stars 81 forks source link

How does the Wikidata graph split affect scholia? #2423

Open dpriskorn opened 9 months ago

dpriskorn commented 9 months ago

Context

See https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/IIA5LVHBYK45FSMLPIVZI6WXA5QSRPF4/

Question

How many queries need to be rewritten? Can all of them be rewritten without adverse effects like timeout? How much effort is it to rewrite? Can the rewriting be automated somehow?

Daniel-Mietchen commented 9 months ago

Thanks for keeping an eye on this, @dpriskorn ! The link in your post does not work for me, so here is another link to what is probably the same message.

We are looking into the matter and do not have good answers to your questions yet, but here are some guesstimates:

  1. How many queries need to be rewritten?
  2. Can all of them be rewritten without adverse effects like timeout?
    • Not if the timeout settings remain the same, since federation adds complexity. Working with a static dataset might have some performance benefits though.
  3. How much effort is it to rewrite?
    • We need to review all queries as to whether they are affected, i.e. as to whether they (a) run on either of the new main or scholarly endpoints and (b) give the same results as the full endpoint. This could probably be largely automated in a matter of hours by someone who understands the matter.
    • For any of the queries that fail to run or where the results differ in substance, we would need to rewrite them. Assuming an average of 5-10 min per query, that means something on the order of a person week of work time. I suspect that some queries might not work usefully, so we would need to change their functionality.
    • Perhaps we need a dedicated hackathon just for such adaptations of Scholia queries.
  4. Can the rewriting be automated somehow?
    • To some extent yes — see also the discussion in #2412 .
    • On the way, we could consider interactions with efforts to document SPARQL queries (e.g. as discussed here) or to modularize them (examples).
fnielsen commented 9 months ago

I have tried one here: https://synia.toolforge.org/#author/Q18618629 just changing the endpoint. There were issues ("Recent publications from experimental scholarly endpoint ")

egonw commented 9 months ago

just changing the endpoint

@fnielsen, I think the split will mean federated SPARQL queries over the two servers. Did you try that already? Where can I find the SPARQL of the Recent publications from experimental scholarly endpoint? I could not spot the link to the matching query service. Does it not have a QS for the individual endpoints yet? That would make development a lot more difficult.

egonw commented 9 months ago

Can all of them be rewritten without adverse effects like timeout?

@dpriskorn, no, I don't think so. This initial split is suffering from the problem we highlighted in a telcon last year: queries break and cannot be easily solved with SPARQL. the key problem is that statements (like P2860) have object and subject split over the two QS-s... this will require to figure out which statements have content in both (multiple) QS-s, then do a fusion of that data, before moving to the next statement

Example query that returns empty is this one: https://w.wiki/98JL

egonw commented 9 months ago

I just tried rewriting it, but it's nasty because essential info is split over the two resources (to be run at https://query-scholarly-experimental.wikidata.org/):

select ?year (count(distinct ?work) as ?number_of_publications) ?type_ ?role where {
  # get the intention types from the "main" WDQS
  SERVICE <https://query-main-experimental.wikidata.org/sparql> {
    ?intention wdt:P31 wd:Q96471816 .
  }

  # get the citing works from the "main" WDQS
  {
    SERVICE <https://query-main-experimental.wikidata.org/sparql> {
      select distinct ?work (min(?years) as ?year) ?type_ where {
        ?work wdt:P577 ?dates ;
              p:P2860 / pq:P3712 ?intention .
        bind(str(year(?dates)) as ?years) .
        OPTIONAL {
          ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
        }
      } group by ?work ?type_
    }
  }
  UNION
  # get the citing works from the "scholarly" WDQS
  {
    select distinct ?work (min(?years) as ?year) ?type_ where {
      ?work wdt:P577 ?dates ;
            p:P2860 / pq:P3712 ?intention .
      bind(str(year(?dates)) as ?years) .
      OPTIONAL {
        ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
      }
    }
    group by ?work ?type_
  }

  hint:Prior hint:runFirst true .

  # now look up some additional info (only available from the "main" WDQS
  SERVICE <https://query-main-experimental.wikidata.org/sparql> {
    ?work wdt:P1433 ?venue_ . ?venue_ rdfs:label ?venue . FILTER (LANG(?venue) = "en")
    MINUS { ?venue_ wdt:P31 wd:Q1143604 }
  }
  bind(
    coalesce(
      if(bound(?type_), ?venue,
      'other source')
    ) as ?role
  )
}
group by ?year ?type_ ?role
order by ?year

It times out.

egonw commented 9 months ago

When I run the query from main I get closer, and it runs in reasonable time:

select ?year (count(distinct ?work) as ?number_of_publications) ?type_ ?venue_ ?role where {
  # get the intention types from the "main" WDQS
  ?intention wdt:P31 wd:Q96471816 .

  # get the articles from the "scholarly" WDQS
  {
    SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
      select distinct ?work (min(?years) as ?year) ?type_ where {
        ?work wdt:P577 ?dates ;
              p:P2860 / pq:P3712 ?intention .
        bind(str(year(?dates)) as ?years) .
        OPTIONAL {
          ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
        }
      } group by ?work ?type_
    }
  }
  UNION
  # get the articles from the "main" WDQS
  {
    select distinct ?work (min(?years) as ?year) ?type_ where {
      ?work wdt:P577 ?dates ;
            p:P2860 / pq:P3712 ?intention .
      bind(str(year(?dates)) as ?years) .
      OPTIONAL {
        ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
      }
    }
    group by ?work ?type_
  }

  hint:Prior hint:runFirst true .

  # now look up some additional info: venue
  # get the venue info from the "scholarly" WDQS
  OPTIONAL {
    ?work wdt:P1433 ?venue_ . ?venue_ rdfs:label ?venue . FILTER (LANG(?venue) = "en")
    MINUS { ?venue_ wdt:P31 wd:Q1143604 }
  }
  bind(
    coalesce(
      if(bound(?type_), ?venue,
      'other source')
    ) as ?role
  )
}
group by ?year ?type_ ?venue_ ?role
order by ?year

But you can see from the results that the venue information is split over the two QS-s (the above query missed venue info). As soon as I try looking up the venue info from both instances, it times out again.

egonw commented 9 months ago

(crossposted from the Telegram Wikicite channel)

I just finished a query that shows how content is scattered over the two splits: https://w.wiki/98km One of the powers of SPARQL is to be able to search the linking ("web"), unlike, for example, label searching. But if we search for a link (the Statement in Wikidata terms), this becomes hard when those links are split too: you effectively have to search in both QSs. This is what I tried yesterday with https://github.com/WDscholia/scholia/issues/2423#issuecomment-1936978903 (above): but since SPARQL commonly includes a pattern of two and more links, this is not trivial at all. Indeed, I ran into timeouts. I do not think this is special for Scholia, but applies to any tool that uses SPARQL where Statements are split over the two instances. Of course, this query just looks at one direct claim, and the GitHub issue shows that "two or more" is with qualifiers.

Basically, splitting works if the content can be split. But the power of Wikidata is the complexity of human language, but then with machine readability. Qualifiers are all over the place. So, when i say, "that I feel that Wikidata has failed", more accurately I should say "the query service has failed" and that I think that the QS is a essential part of the eco system (also for Wikibase, for the matter). This is just opinion. Let me stress, the problems are real and we need a real solution. This real solution is hard. This splitting is not the first solution being sought. The Scholia project has been actively looking into alternatives, including a dedicated WDQS, a QS with a lag (but see notes about loading times being days, rather than hours), and the subsetting work (see https://content.iospress.com/articles/semantic-web/sw233491). It is compicated and 5 years ago I has naive and optimistic that computer science would develop a scalable triple store with SPARQL endpoint that meets the Wikidata needs. Sadly, the CS field did not live up to my hopes. So, my tears (":(") are real. And the scalability problems that Wikidata are seeing important and to me very serious and nothing to joke about.

fnielsen commented 9 months ago

Where can I find the SPARQL of the Recent publications from experimental scholarly endpoint?

https://synia.toolforge.org/#author/Q18618629 - third table

egonw commented 9 months ago

https://synia.toolforge.org/#author/Q18618629 - third table

Yes, got that :) But unlike the other tables, this one does not have a link to the matching query service. I wanted to see the SPARQL itself, not the results.

I think I should be able to find it in the Wiki itself, but the Synia setup I wrote was already too long ago that I can easily find it.

fnielsen commented 9 months ago

I wanted to see the SPARQL itself, not the results.

https://www.wikidata.org/wiki/Wikidata:Synia:author#Recent_publications_from_experimental_scholarly_endpoint

egonw commented 9 months ago

https://www.wikidata.org/wiki/Wikidata:Synia:author#Recent_publications_from_experimental_scholarly_endpoint

Thanks. Now that I have seen the query, I think that one runs into exactly the problem I experienced and tried to describe.

dpriskorn commented 9 months ago

Instead of rewriting or bothering about the split, I suggest we focus on running QLever ourselves and improve it to do what we want no matter the growth of Wikidata. See the discussion I started https://github.com/WDscholia/scholia/discussions/2425

physikerwelt commented 9 months ago

I discussed that briefly with @Daniel-Mietchen today. To me it seems that the one time split does conceptually not solve any scaling issues and should not be done in the way as it is currently planned. If done, it should be done transparently to the user, i.e. the query might be executed on different back-ends, but it should not be required to change the query.

egonw commented 9 months ago

i.e. the query might be executed on different back-ends, but it should not be required to change the query.

What I found is that this is not trivial at all: you cannot simply run a query on both endpoints and then merge the results.

WolfgangFahl commented 8 months ago

see #2412 - for a mitigation path