amchagas / open-hardware-supply

having a closer look on how OSH papers are evolving over time
MIT License
5 stars 2 forks source link

Text API for getting text or downloading papers #1

Closed matiasandina closed 5 months ago

matiasandina commented 4 years ago

Explore here

amchagas commented 4 years ago

some more options https://www.biostars.org/p/47992/ https://github.com/JabRef/jabref and this plugin https://github.com/lehner/LocalCopy

amchagas commented 4 years ago

another rogue/bazooka approach would be to download all of scihub and filter by DOI https://opendata.stackexchange.com/questions/7084/bulk-download-sci-hub-papers

amchagas commented 4 years ago

Here is the link to the package @amgfernandes suggested: https://github.com/billgreenwald/Pubmed-Batch-Download The downside of this one, is that it is limited to PubMed, which ends up restricting a bit the number of papers we get out, as PubMed is limited to bio/medicine fields

matiasandina commented 4 years ago

JabRef sounds too good to be true but might be worth a try! Let's do that

another rogue/bazooka approach would be to download all of scihub and filter by DOI https://opendata.stackexchange.com/questions/7084/bulk-download-sci-hub-papers

I don't think this is the best way. Ultimately, if we come up with a list of papers that we need and can't automatically find, we can try to get them using alternative ways (e.g., direct request to author)

amchagas commented 4 years ago

I'm super divided about this idea of using Sci-Hub as a resource. I agree that it has important implications, but at the same time if we use it, we make a contribution to the open access debate (even in if in a convoluted way?)

Also https://www.sciencemag.org/news/2016/04/whos-downloading-pirated-papers-everyone It is that kind of thing that "we are all doing it we are just not admitting to doing it"

amchagas commented 4 years ago

But to make clear, if this brings discomfort, then we should def. NOT use it.

matiasandina commented 4 years ago

I have no discomfort, it was one of the first ideas that came to me, but would this be something that gets the paper "blocked/inadmissible"?

amchagas commented 4 years ago

https://onlinelibrary.wiley.com/doi/epdf/10.1111/cid.12815 this is obviously behind a paywall... :roll_eyes: I wonder which tool I could use to access it... :grimacing:

amchagas commented 4 years ago

didnt read it yet, but there are a few papers that seem to debate this idea of using sci-hub to make systematic reviews, etc

matiasandina commented 4 years ago

amchagas commented 4 years ago

:rofl: So the previous link was a letter to the editor, making comments about this paper https://onlinelibrary.wiley.com/doi/epdf/10.1111/cid.12706 which seems to have used Sci-hub

solstag commented 4 years ago

Ni! Hi folks. I'm really sorry I missed the meeting today, that sucks on my part.

So, full text mining is an inception of nightmares.

  1. First, you need to get the URI for the full-text of the object. Currently the most standardized official way to access full texts is the Crossref API. Other than that, you have specific APIs for some publishers, open-access or not.

  2. Then, you have to download the file. If the paper is not Open Access, this will require being in an authorised network, and even then may require manually clicking through to confirm you have access. You might sometimes be able to use the Unpaywall API to get around some restrictions, such as closed access papers that have a version on arXiv.

  3. Then, you have to be able to process the format that the file is in. Step 1 may give you a choice of formats for step 2. But even if you get here, not two PDFs or HTML pages are the same. So it is not so simple to extract information. Any information displayed in mathematical language or graphics will require special processing if you want to take them into account.

  4. Even in a best case scenario, the above steps will only work for part of the papers you want to retrieve. Some papers may only exist as part of a collection, such as annals of events. Others will simply not have a proper paper level DOI. Some may not be found through Crossref. Some might be there but not provide a link to the full-text. And finally some may give you full text but in a weird formats or that does not correspond to the paper, such as, again, annals of conferences.

(And, of course, all that is just about downloading, and comes before considering that the resulting full-text of papers may be quite heterogeneous, some will be conference abstracts, extended abstracts, letters, notes, short articles where the bulk of the methods are in supporting information, full articles tens of pages long, book chapters, and even entire books.)

In order to illustrate, the ContentMine project, for one, has been building some tools to mine full-text from multiple sources that provide it in diverse but standardized formats, such as pubmedcentral, arxiv, ieee etc. One tool is called getpapers. Then they had a second tool called norma that would consolidate these papers in a single standardized format. They also had a trove of other tools for example to extract information from plots. In any case, they were mostly concerned with open access information. Getting through the paywalls across publishers without huge resources is a no go.

Now, what happens is that most people not doing systematic reviews, where only a relatively small number of papers (a few hundreds) are required, do not work with full text. They try to get the best out whatever metadata is available. Citations, abstracts, keywords, authors, institutions, etc.

I think in the present case we are somewhere between the typical size for a systematic review and the size for an actual large scale analysis. So it is tempting to get the full text. But be aware that it will require compromises and a patchwork of methods for all the different sources. In any case, the best thing is to start with the easy things, like trying to get all the full-text we can from the CrossRef data. Then evaluate what we could not gather and decide what is worth going after. Another thing is to start working without the full-text, using the bibliographic data, in particular the abstracts and keywords, to see how far can we go. This would also help us make later choices regarding full-text compromises.

Abraços! And again sorry for skipping the meeting today.

matiasandina commented 4 years ago

Hi, Our automatic approach will have caveats, but we are interested in word tabulation for coarse filtering, author metadata parsing and maybe using the full text to train a ML classifier to broadly predict the scientific area.

Summarizing the alternatives that I think can work

1 - Bulk download from sci-hub (and different Arxivs?), parse text from pdf. 2 - Use the crossRef API from an authorized account as suggested and download text directly from python with DOI. Save text into local file and parse. If text not available but we want that paper, save it for manual download? 3 - Using the API suggested by @amgfernandes to download from PubMed. Parse text from pdf.

solstag commented 4 years ago

Ni! Cool. So,

Downloading from SciHub poses a lot of issues and imo should be left as a last resort. People have been sued and threatened for doing data mining like that, and it sends the wrong message that "everything is fine, we can datamine with SciHub, no need for those legal full-text databases". If there is a significant number of papers we can't get from open archives and can't imagine manually downloading, then we can discuss whether to use SciHub or simply ignore them.

I also think we'd be better off hard-splitting the task in three: (1) decide on criteria and produce a list of DOIs/IDs that interest us, (2) get metadata for those, (3) get the full-text. They're quite different and the best sources for search, full-text and metadata are not the same. It would also allow us to move forward independently with analysis on each front. This would take some restructuring of the repo and perhaps the code as well. I can do the repo if you're all fine with that.

For task 3, I have a hunch that the best way forward is to use Unpaywall to find whatever open-access versions are available for each paper, download them, and then do an assessment of the remaining DOIs/IDs to decide on how to proceed. In any case, it is not useful to plan ahead if we don't know what we'll be missing.

For task 2, as I had discussed with Andre, a mix of Scopus, WoS, CrossRef and {bio,}ArXiv should do the trick. We'll only need some logic to merge those databases and treat duplicates.

Finally, for task 1, it seems you're moving along with building a decent list of terms. I also suggested to Andre that it would be interesting for someone to read papers and annotate excerpts that represent information that interests us.

[ ]'s

matiasandina commented 4 years ago

The Sci-Hub comments sound sensible.

There's something I don't fully understand about the 3 splits. In order to get the list of DOIs of interest, don't we need to filter from metadata (title, abstract)?

Unpaywall seems super exciting and they openly share the code to do analysis. Of note, see quote from here

In total, 31,159,960 journal articles published between 2008 and 2018 were included in Unpaywall. For 11,633,886 articles, Unpaywall was able to link a DOI to at least one freely available full-text (37 %). This means that around every third scholarly journal article published since 2008 is currently openly available.

I think the Unpaywall R version is by far the easiest (but please challenge this). They also share the analysis they did for the paper here

If I understand correctly, unpaywall would give us links to available papers we would need to download, right? So the parsing pdfs is still a thing we have to do anyway?

Grabbing the full text from pdf is somewhat straightforward (we have a somewhat functional code in the repo). I also think we should split the pipeline so it's more functional, this parsing pdfs would be the next action ?

So the full pipeline looks like:

1) Somehow make list of DOIs 2) Hit Unpaywall API for the open links and download 3) Parse pdfs 4) Do analysis of actual questions (4b) Go back and tweak all steps to make it better 5) Write the thing

amchagas commented 4 years ago

Hey! Just to say that I'm following this, but have nothing to add :P and up for changing repo and code as needed!

amgfernandes commented 4 years ago

I will post here some links that we may have a look:

https://api.semanticscholar.org/

https://www.connectedpapers.com/

solstag commented 4 years ago

Ni!

That sounds good, matias, so in 4b we could go back to 2 and check what is missing, and if we can get it by other means, for example closed access articles which may be more representative of earlier efforts that we wouldn't want to miss.

About accessing the unpaywall API, there's a python wrapper too:

https://pypi.org/project/unpywall/

Since python is imo better equipped for the ensuing analysis, it might help to keep our pipeline more simple, but I'm not opposed to using any other language.

.~´

solstag commented 4 years ago

Ni! I've juste remembered about these folks, it may be useful as a replacement/addition to unpaywall, with the advantage that their API provides machine readable full text (supposedly).

https://core.ac.uk/