WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
225 stars 81 forks source link

WDQS scaling issues meetings in February #1806

Closed dpriskorn closed 2 years ago

dpriskorn commented 2 years ago

Context

https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/thread/KPA3CTQG2HCJO55EFZVNINGVFQJAHT4W/

Question

Is it a good idea to participate and deliver our perspective?

Daniel-Mietchen commented 2 years ago

@dpriskorn Probably yes.

For the record, this relates to

fnielsen commented 2 years ago
  1. WDQS scaling community meeting 1/2: SPARQL query features - Thursday, February 17 · 18:00 UTC
  2. WDQS scaling community meeting 2/2: RDF store backend needs - Monday, February 21 · 18:00 UTC
Daniel-Mietchen commented 2 years ago

See also https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/Feb_2022_scaling_community_meetings

dpriskorn commented 2 years ago

I created the request for bot flag for So9qBot earlier and met resistance presumably because of fear of breaking/overloading blazegraph and thus render WDQS unusable for everyone.

Since I helped raise the issue in the Wikidata Telegram channel and with the product manager, a lot has happened.

WMF started analyzing the issue in depth and now we have a disaster playbook. 🥳

WMF also very recently began the process of evaluating alternatives to blazegraph and now we have a rough timeline of 2-3 years until the problem is completely solved, assuming a competent team of engineers are dedicated to the task and funded appropriately.

I intend to finish the original idea of asseeibot soon so that it can add new items for each DOI found in Wikipedia, which we are currently missing in Wikidata.

This bot, if approved by the community, will increase the number of scientific items by 10-15% at a pace the community can control. If all direct references to the DOIs found are also imported the number of items will probably double to 80M over time.

I also plan to create an importer for refcat from IA:

This first release of the Refcat dataset contains over 1.3 billion citations extracted from over 60 million metadata records and over 120 million scholarly artifacts [...]

This will bring our collection of papers in Wikidata up to a level around 120M in total. 🎉

Nobody knows if these two imports will push blazegraph past the breaking point, but that is not a big issue IMO since only 1% of all queries in WDQS are affected and since the data just existing and being curated and editable by any scientist in the world will probably be an epic game changer. I predict that a majority of scientists are gonna want to be well represented in our graph within 2-5 years.

If WDQS breaks, the playbook is enacted and we can continue working, but without SPARQL support from WMF during an interim period. This will of course affect Scholia a lot. We would have to set up our own sparql endpoint with the data somewhere. My hope is that we will succeed in getting a working endpoint up in less that 2 weeks with the help of IA and others in the Scholia and WikiCite communities.

The upside of this for Scholia is that all the new papers and citations would take us to a level of completeness that is on par with closed commercial databases currently used by scientists,but which completely lacks both the openness and empowerment of Wikidata and the graph powered features for finding author networks and new papers efficiently.

Since it takes roughly 3 months in average according to a study I read for a scientific article to be included in Wikipedia my bot alone would keep us only 3 months behind the bleeding edge of science publications.

We would have to find another approach to get closer to real time import of scientific papers as they are published.

Daniel-Mietchen commented 2 years ago

The notes from the first meeting on Thursday Feb 17 about SPARQL query features sit at https://etherpad.wikimedia.org/p/R5n382Ld0Vvykc7Ak3iH .

I could not attend but @fnielsen and @dpriskorn did, and I went through the notes later on, particularly adding examples of Scholia queries that time out:

The next meeting on RDF store backend needs is today at 18:00 UTC, and I'll try to be there.

Daniel-Mietchen commented 2 years ago

The WDQS scaling call just ended, and I found it useful. Notes in https://etherpad.wikimedia.org/p/yPUhyhbmXglC_Magx0Go .

fnielsen commented 2 years ago

Blogpost: "What SPARQL keywords do we use in Scholia?"

Daniel-Mietchen commented 2 years ago

Official summary: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS-scaling-update-feb-2022