diegodlh / zotero-cita

Cita: a Wikidata addon for Zotero with citations metadata support
GNU General Public License v3.0
233 stars 12 forks source link

Fetch an open access PDF via P953 #235

Open Futur3r opened 1 year ago

Futur3r commented 1 year ago

Wikidata can function as a hub to automatically find an open access PDF version for a Zotero item if it as a QID.

Describe the solution you'd like If the Zotero PDF scraper doesn't find any open access PDF of an article on a webpage, Cita could fetch an open access URL of this article via the property P953 of a wikidata element, if available, and give it back to the Zotero PDF scraper for an automatic second try.

For this task, the Hub could be used, by building this kind of URL for example: https://hub.toolforge.org/[QID]?property=P953 The Hub would return the value of P953, for example with the element Q114149071 -> test. Note: if the property doesn't exist, the Hub returns the URL of the element on wikidata.org (test), so a simple if statement would be needed to check if the P953 of an element exist.

The automatic way:

  1. the user use the Zotero browser add-on
  2. as the zotero item is created, Cita fetch it's QID if available
  3. if Zotero fail to download the PDF, Cita go fetch the P953 URL and give it to the Zotero PDF scraper

Note: to not over-complicate things for the user, if Cita doesn't find the QID on the first try, do not prompt any error, maybe just a debug().

The semi-automatic way:

  1. the user doesn't have a PDF attach to his Zotero item
  2. the user use the function to fetch the QID of the item
  3. if upon fetching the QID the property P953 is available, start the Zotero PDF scraper with the URL of the P953

Note: the user would have enabled this functionality of auto-scraping the PDF via wikidata, in the Cita preferences (the functionality would probably be enabled by default)

The manual way:

  1. the user use the "Find the PDF" functionality of Zotero in the 'right-click' menu of the library
  2. if Zotero fail to download the PDF, Cita go fetch the P953 URL and give it to the Zotero PDF scraper

Note: some P953 doesn't reference a PDF but webpage with text as this one, maybe in that case Cita would open the page on browser (like for QuickStatement) or a snapshot of the webpage could be made. It's maybe something that could be added directly in the translator of this website, I don't know ..? Also, do the Zotero PDF scraper need the URL of the PDF directly, or does it use a translator to find the URL of the PDF on a webpage ?

The zotero-scihub add-on implements similar functionalities.

Futur3r commented 1 year ago

Or, the easy way, when Cita as fetched the QID of a zotero item, it just change the URL value of the item with the one of the P953. That way there is no need to modify the existing PDF scraper of zotero.

Also, an option "Fetch Open Access URL" could be added in the menus.

Dominic-DallOsto commented 1 year ago

This looks like how the attachment is added

https://github.com/ethanwillis/zotero-scihub/blob/ecc63def1bea5cee3e342f832f8c743f4d3b61a0/content/zoteroUtil.ts#L7

Basically we just give a URL to the PDF and Zotero should do the rest.

Adding this as a new function should be easy enough. I'd have to check how easy/hard it is to integrate a new PDF provider into Zotero's "PDF finder".

Is there any way we can quantify the rough number of items (for scholarly articles) that have P953 but Zotero won't already find a PDF for them? Or at least what proportion of scholarly items on WD have P953? Like, is this change likely to find a lot of PDFs that wouldn't already be found?

Futur3r commented 1 year ago

And I think the Zotero function for the PDF scraper is this one.

I started to code an option in the items submenu of the library (the easy way). I am adding the option "fetch Open Access urls". It maybe is redundant, but easier to code for me. I'll make a PR.

There is currently 2 564 303 WD elements with a P953 statement. The best query would be this one but it times out. I got 1 351 783 just for scholarly articles that have a P953. The total number of scholarly articles on WD is 38 856 462.

And the data in WD is constantly increasing, easy to check and contribute by human, so it will only go up. Also, WD is kind of the only way to do this for any scientific work, anywhere on the web. I frequently find articles, booksections, ... behind paywalls but available in ResearchGate or HAL.

Futur3r commented 1 year ago

I've heard that some years ago, the EU passed a law that authorize European researchers to publish the manuscript of their papers anywhere they want, 6 month after the date of publication in a journal.

So this functionality can be quite handy.