WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
219 stars 78 forks source link

Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed #2436

Open prototyperspective opened 6 months ago

prototyperspective commented 6 months ago

What is the issue? Most studies (and books) are not in Scholia since they're not in Wikidata.

Why is this a problem? The platform's value is determined by the data quality and extent of Wikidata. However, most books and papers are not yet imported to its structured format. Scholia could start to become truly useful by AI-set "main subjects" data (see e.g. #1896 #1733 #1730 for topics-related use-cases) and statistical charts if maybe 40% of all studies or 60% of cited/notable ones were included but it seems like currently not even 5% of all have been integrated (for example not even most of those studies usually in the uppermost altmetrics-percentiles whose images I've uploaded here).

How could this be addressed? It could be solved by bulk-importing (and updating/refining) based on some database using some script. Please see my post about this here which links to several either potential or readily available such datasets.

What are good places to discuss this? Here and at the linked page as well as maybe some other Wikidata place less focused on books and more about scientific papers.

egonw commented 6 months ago

Thanks for the cross-link! Often changes or changes on the Wikidata side need updates here. The book example with versions and editions is an important one.

Scholia indeed just visualizes what is in Wikidata, and here to make Scholia as useful as possible (without resulting in timed-out queries). Scholia should not, imho, be a platform to discuss what Wikidata can handle or not. That is, mass-importing data is to be discussed on Wikidata (as is in this case). But the simple fact is that the current Wikidata platform is not as scalable as everyone would love it to be.

fnielsen commented 6 months ago

I am under the impression that the bots that imported and annotated wikicite data has been switched off due to the fear of Wikidata Query Service coming into troubles.

andrawaag commented 6 months ago

I disagree. Bulk importing is not a solution. I have switched of some if not all bulk importing bots, not because of fear of the WDQS getting into trouble, because bulk importing will make Wikidata less useful. We don't know the exact number of all books and papers, but even the most conservative estimates give a number that is way bigger than the number of current Wikidata items. So trying to achieve complete recall is basically impossible.

IMO we should try build AI-ready corpora not by increasing the coverage in Wikidata, but by creating independent RDF graphs on books and papers using the Wikidata namespace. Building RDF Graphs of the size of Wikidata (or bigger) is relatively easy if directly constructed as RDF graphs (ie. not having to rely on the limitations of the Wikibase API). This approach still requires Wikidata bots, because main topic items still will need to be minted.

prototyperspective commented 2 months ago

bulk importing will make Wikidata less useful

The opposite is the case. Currently, it's not really already useful in the real world at least when it comes to studies but that would change once the data is more complete.

We don't know the exact number of all books and papers, but even the most conservative estimates give a number that is way bigger than the number of current Wikidata items.

Not true. Got to admit that the number of Wikidata items lower than I though and the share of studies larger. However, per this page the count of items is somwhere around 111 million at this point. ScienceOpen contains most studies and it currently has 95 million items. So if preprints aren't added (note many of them have been imported) and some of the most notable items not in SO (maybe OpenAlex has them) were included one could estimate it to be somewhere around 150 million items. That's indeed larger than the current number of WD items but not by much, it would double its size and this wouldn't include books or food products which would be good to import as well. I think Scholia would start to become useful, recommendable, used and not misleading once it reaches around parity with scienceopen and that would be below the current number of items.

So trying to achieve complete recall is basically impossible.

It's not impossible and I see no reason why it would be rather than many demonstrations that bulk imports work quite well and could be scaled up. I don't know if the imports are done from a local server or written remotely using an API and what the current ways to improve performance would be (such as caching).

If that was the case then why even spend time developing Scholia? If not more bulk-importing is done then this is kind of a waste of time and not useful. For example, charts about a person's or a topic's number of studies per year are otherwise only misleading instead of useful to human users and AI. They give a wrong picture of whatever is looked at.

Sorry if this sounds a bit hurtful but otherwise I don't think there is much potential for Scholia except as an UI for Wikidata users to more easily see issues like when people use WikiFlix to improve films-related data (instead of using it to watch films or anything else) and even there the usefulness may be absent or minimal.

because bulk importing will make Wikidata less useful.

No reason why that would be the case, not true – why would it make it less useful? For example, in the search results one could filter away scholarly articles so they don't show up (potentially even with the click of a button).

As for issues with data, the scripts should be well-written, well-tested/investigated and when things go wrong this can be fixed with scripts. There are further ways to mitigate issues such as locking bot-imported articles to only be editable after request for unlock or by bots or sth like that.

AI-ready corpora

Not what one would think of Scholia, I never thought of Scholia only as a tool for training AI – for example why is it sometimes linked in Wikipedia articles when it's only intended for AI? But even then, this would mistrain AIs so they produce flawed results and similar issues.

but by creating independent RDF graphs on books and papers using the Wikidata namespace

Well maybe I misunderstood you then above, I'll leave the above unedited nevertheless...I see no reason why wikidata items couldn't be as performant as RDF graphs if they are more so. Wikidata items could be converted to RDF graph nodes, essentially cached and used as such and updated once a Wikidata item is edited which is only retrieved when a human uses the Wikidata interface to open the WD item. This is just an overly broad outline. For example, the API doesn't have to be used...it's the default but theoretically one could copy files to where the data is stored in batch data upgrades.

Also commented in the WikiProject Books thread as well as the talk pages of two editors whose bots imported lots of studies, more input would be appreciated.
I think it's critical that data of a type/field like 'academic articles' becomes fairly complete so that actual usecases/applications can be built and that sustainable efficient automated bulk imports are designed for that.