Indexing OAI-PMH non-compliant repositories

wetneb commented 8 years ago

So, BASE (http://www.base-search.net) indexes many repositories that have an OAI-PMH interface, which is already very useful and covers most of the respectable repositories.

But unfortunately, many repositories do not provide such an interface. And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record. Reaching out to the repository admins to encourage them to expose their metadata correctly is, from my experience, not effective at all.

If we want to go beyond this, I think we need to crawl!

One (conceptually) simple option would be to crawl for PDF files, extract metadata from them and dump this in a proper database with indexes. I think this is what CiteSeerX does, but only for papers within a particular field. In my experience the metadata that comes out of this is quite noisy.
The other option we have discussed would be to leverage existing scrapers (Zotero) to extract cleaner metadata from HTML pages. Zotero does the scraping pretty well, but I have no clue what crawling software we should use. Any idea? The scrapy framework looks nice but I'm a complete newcommer in this field so I have probably missed better options.

How much resources (servers) do we need for this? Where could we get them?

mekarpeles commented 8 years ago

@wetneb thanks for this excellent summary. We can apply some pressure and communicate the importance of this initiative to the Internet Archive, they may be able to provide us with resources (disk/storage space). I can talk to Brewster this Friday about that possibility (a researcher VM with ~5tb of space, for starters). At very least, hopefully it will help us develop crawlers, write tutorials for the community to follow, and store documents temporarily (we can also use the openjournal user account to upload files directly to the Internet Archive, and they have an S3 style API for bulk upload. Internet Archive also does OCR on papers/pdfs that are uploaded, which can be a big mutual win. Re: noisy data, I am not sure how accurate their OCR is for academic (especially math-heavy) works. Perhaps worth exploring as an experiment.

Regarding crawlers, perhaps we can create a document an awesome-list of all crawling initiatives for academic papers, journals, databases, and metadata. I have ideas on how we can leverage libraries like scrapy -- as you elude to, I'm sure there existing tools (and I imagine creating n source-specific crawlers and reverse-engineering their indexing scheme, and having/maintaining a repository of these crawlers of each source the public to use, will be more success / more comprehensive than a deep crawl)

@pietsch, this is a fairly comprehensive list of publicly accessible sources that use OAI-PMH, right?

cleegiles commented 8 years ago

Please see comments below.

It is important not to reinvent wheels. Some of what we discuss falls into that domain.

On 11/30/15 5:45 PM, Antonin wrote:

So, BASE (http://www.base-search.net) indexes many repositories that have an OAI-PMH interface, which is already very useful and covers most of the respectable repositories.

But unfortunately, many repositories do not provide such an interface. And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record. Reaching out to the repository admins to encourage them to expose their metadata correctly is, from my experience, not effective at all.

If we want to go beyond this, I think we need to crawl!

One (conceptually) simple option would be to crawl for PDF files, extract metadata from them and dump this in a proper database with indexes. I think this is what CiteSeerX does, but only for papers within a particular field. In my experience the metadata that comes out of this is quite noisy. No longer, CiteSeerX now crawls for all scholarly documents. Please note that some repositories prevent crawling with their robots.txt, except for Googlebot.

Yes, it is noisy but not that bad and is always getting better. Others are doing this and contributing to the extraction algorithms. Please see our previous list of extraction methods. We may have a tutorial at major conference soon.

The other option we have discussed would be to leverage existing scrapers (Zotero) to extract cleaner metadata from HTML pages. Zotero does the scraping pretty well, but I have no clue what crawling software we should use. Any idea? The scrapy framework looks nice but I'm a complete newcommer in this field so I have probably missed better options. Our crawling code is available on the CiteSeerX GitHub. We are always crawling and recently with Semantic Scholar at AI2. We use Heritrix, an excellent tool. How much resources (servers) do we need for this? Where could we get them? This is storage and bandwidth intensive. The problems are:

crawling is very time consuming and needs a great deal of coordinated parallel threads.

crawling brings back unwanted PDFs which must be filtered or classified. We have a few papers on this if anyone is interested.

a scholarly paper PDF is about one Meg - a million a Tera. However, the others have to are usually stored.

We've found a factor of 3 for all documents we expect to crawl. That is we crawl 3 times as many PDFs for the same number of PDF files.

We share our crawl seeds if anyone wants to use them.

Who here crawls?

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/8

cleegiles commented 8 years ago

These papers may be of help. One can publish improvements. We would very much like to improve ours. Suggestions most welcomed.

Best

Lee

On 11/30/15 7:04 PM, Michael E. Karpeles wrote:

@wetneb thanks for this excellent summary. We can apply some pressure and communicate the importance of this initiative to the Internet Archive, they may be able to provide us with resources (disk/storage space). I can talk to Brewster this Friday about that possibility (a researcher VM with ~5tb of space, for starters). At very least, hopefully it will help us develop crawlers, write tutorials for the community to follow, and store documents temporarily (we can also use the openjournal user account to upload files directly to the Internet Archive, and they have an S3 style API for bulk upload. Internet Archive also does OCR on papers/pdfs that are uploaded, which can be a big mutual win. Re: noisy data, I am not sure how accurate their OCR is for academic (especially math-heavy) works. Perhaps worth exploring as an experiment.

Regarding crawlers, perhaps we can create a document an awesome-list of all crawling initiatives for academic papers, journals, databases, and metadata. I have ideas on how we can leverage libraries like scrapy -- as you elude to, I'm sure there existing tools (and I imagine creating n source-specific crawlers and reverse-engineering their indexing scheme, and having/maintaining a repository of these crawlers of each source the public to use, will be more success / more comprehensive than a deep crawl)

@pietsch, this is a fairly comprehensive list of publicly accessible sources that use OAI-PMH, right?

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/8#issuecomment-160803883

mekarpeles commented 8 years ago

@cleegiles Thanks for the excellent contribution. Also, for the record, I view OpenJournal as a working group for reducing (exactly as you say) re-implemented wheels. So I'm on board! I don't anticipate building anything directly through OpenJournal -- if I contribute to something, it will be an existing project (unless there's a very compelling reason why something new needs to be built, and if that's the case, I'd still prefer someone else leads it, i.e. limited bandwidth).

There are several folks affiliated w/ the Internet Archive I know who are working on crawling efforts. Some aren't comfortable announcing. @nthmost can likely share her ideas for pubmed crawling. I have experience crawling, am happy to participate, but do not currently work on a crawler.

@cleegiles I'll try to generate some traffic to this thread by pinging other institutions and seeing if they can weigh in on the status of their crawlers -- thanks for nudging us in that direction.

wetneb commented 8 years ago

@cleegiles Thanks for the clarification, it's great that CiteSeerX is not limited to comp. sci. anymore! Do not get me wrong, what CiteSeerX does is massively useful. But at http://dissem.in we need to match publications with preprints, which is quite hard as soon as the title or the authors differ by a few words. So, metadata quality is critical for us.

I can imagine that CiteSeerX requires a lot of resources indeed. We cannot afford this for dissemin, and this is the reason why I thought using scrapers would be cheaper (both in terms of bandwidth, storage, and computing) and could potentially yield cleaner metadata. But this option only works for repositories, not home pages.

I have been playing around on indexing researchgate.net recently (with crawling and scraping), and I have been in touch with Mike Taylor who has started something along these lines for SSRN.

@mekarpeles Having a researcher VM would be amazing! And I'm really looking forward to hearing more crawling stories, especially from the Internet Archive!

davidar commented 8 years ago

And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record.

Yes, this is an issue we're trying to deal with in ipfs/archives#3 (enriching OAI-PMH metadata with fulltext links).

@cleegiles Is this something that CiteSeerX could help with?

wetneb commented 8 years ago

@davidar that's awesome! I hope you will succeed.

davidar commented 8 years ago

@wetneb well, my plan was basically the same as what you outlined in the OP, so I'm afraid I might be reinventing the wheel? Perhaps this is something we could collaborate on?

wetneb commented 8 years ago

@davidar I would love to! Joining the discussion there then.

pietsch commented 8 years ago

Hi @mekarpeles,

@pietsch, this is a fairly comprehensive list of publicly accessible sources that use OAI-PMH, right?

Yes, these are the 3881 OAI-PMH sources BASE is currently harvesting (in intervals). The other lists I am aware of are the Directory of Open Access Journals (DOAJ) and the official (if incomplete and out-of-date) list of OAI-PMH repositories.

cleegiles commented 8 years ago

We do this with our crawler. If the link goes directly to a non-open source publisher, there is no reason to crawl. We have a blacklist and whitelist of where we go now which we can share. It's fairly complete.

Another way is to parse the link to the document. Links that are not directly to pdfs are usually not downloadable. It would be useful to do a sample to see how often this is true, but we've found it to nearly always be the case. A counter example is where one has to sign in but anyone can have an account.

On 12/1/15 3:03 AM, David A Roberts wrote:

And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record. Yes, this is an issue we're trying to deal with in ipfs/archives#3 (enriching OAI-PMH metadata with fulltext links).

@cleegiles Is this something that CiteSeerX could help with?

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/8#issuecomment-160888712

cleegiles commented 8 years ago

The best metadata seems to be the Web of Science (WoS) but it has to be purchased and is not cheap.

This seems to be a good source of metadata, but we have not compared it to ours

http://research.microsoft.com/en-us/projects/mag/ We currently have a project to clean up our and other metadata.

On 12/1/15 2:27 AM, Antonin wrote:

@cleegiles Thanks for the clarification, it's great that CiteSeerX is not limited to comp. sci. anymore! Do not get me wrong, the metadata CiteSeerX extracts is very useful. But at http://dissem.in we need to match publications with preprints, which is quite hard as soon as the title or the authors differ by a few words. So, metadata quality is critical for us.

I can imagine that CiteSeerX requires a lot of resources indeed. We cannot afford this for dissemin, and this is the reason why I thought using scrapers would be cheaper (both in terms of bandwidth, storage, and computing) and could potentially yield cleaner metadata. But this option only works for repositories, not home pages.

I have been playing around on indexing researchgate.net recently (with crawling and scraping), and I have been in touch with Mike Taylor who has started something along these lines for SSRN.

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/8#issuecomment-160882106

nthmost commented 8 years ago

Hi everybody,

As @mekarpeles mentioned, I've been working on PubMed collection efforts for over a year now. That code is represented in the metapub project and install-able via pypi (pip install metapub).

The primary purpose of the FindIt tool within metapub is to be able to pull fulltext article matter (just PDFs right now) at high identity confidence. I.e. if a researcher thinks they are getting pubmed ID #123456, the result of using FindIt should be exactly that article about 99% of the time.

Here's the overview of how it works. Starting from a PubMed ID, FindIt does the following steps on each article:

uses the pubmed ID to pull down the PubMed XML for the article
uses the PubMedCentral ID, if any, to produce a url to a pdf
if not in PubMedCentral, looks up the journal name within the FindIt machinery to see if we can apply a known “dance” to get a PDF link on the publisher’s website.
if journal name not currently filed in FindIt, reports as “NOFORMAT"

I can explain all of this in detail -- for now I think it suffices to say that my approach is very different from crawling or screen scraping, and this is very much by design.

I built and deployed this engine in production at a genetic testing/diagnostic company to save scientists time in tracking down the article texts they needed to research and prove the calls they were making on people's genetic test reports.

As a result, FindIt's coverage of the NCBI journal list is heavily skewed towards medical genetics, and its testing has focused on pubmed citations found in HGMD and Clinvar. This gave FindIt a nicely controlled constraint for its success; now it's time to branch out and try to complete its coverage across all PubMed domains.

I recently completed a long-running coverage test in which I iterated over every named NCBI journal from the Entrez list, found 3 to 5 article IDs per journal from different years (if possible), and then ran FindIt over those IDs. (Total pmids = ~117k.) I have yet to analyze these results, but will probably do so on the plane back from Hawai'i (where i've been hiding during the evolution of all this discussion).

After completing metapub's coverage of PubMed, I'm interested in starting a project with the same design constraints (i.e. high confidence, next-to-no actual scraping) that covers all journals that have DOIs. There is a lot of machinery in metapub that could apply to a broader swath of disciplines via the use of the CrossRef API and the dx.doi.org redirect.

I'm eager to get more involved with all of you!

cleegiles commented 8 years ago

Good thing to do.

If I understand what you are doing, you are interested in full documents, PDFs?

If so, how do you extract the text from the PDF?

Best

Lee

On 12/1/15 4:16 PM, nthmost wrote:

Hi everybody,

As @mekarpeles mentioned, I've been working on PubMed collection efforts for over a year now. That code is represented in the metapub project and install-able via pypi (pip install metapub).

The primary purpose of the FindIt tool within metapub is to be able to pull fulltext article matter (just PDFs right now) at high identity confidence. I.e. if a researcher thinks they are getting pubmed ID #123456, the result of using FindIt should be exactly that article about 99% of the time.

Here's the overview of how it works. Starting from a PubMed ID, FindIt does the following steps on each article:

uses the pubmed ID to pull down the PubMed XML for the article

uses the PubMedCentral ID, if any, to produce a url to a pdf

if not in PubMedCentral, looks up the journal name within the FindIt machinery to see if we can apply a known “dance” to get a PDF link on the publisher’s website.

if journal name not currently filed in FindIt, reports as “NOFORMAT"

I can explain all of this in detail -- for now I think it suffices to say that my approach is very different from crawling or screen scraping, and this is very much by design.

I built and deployed this engine in production at a genetic testing/diagnostic company to save scientists time in tracking down the article texts they needed to research and prove the calls they were making on people's genetic test reports.

As a result, FindIt's coverage of the NCBI journal list is heavily skewed towards medical genetics, and its testing has focused on pubmed citations found in HGMD and Clinvar. This gave FindIt a nicely controlled constraint for its success; now it's time to branch out and try to complete its coverage across all PubMed domains.

I recently completed a long-running coverage test in which I iterated over every named NCBI journal from the Entrez list, found 3 to 5 article IDs per journal from different years (if possible), and then ran FindIt over those IDs. (Total pmids = ~117k.) I have yet to analyze these results, but will probably do so on the plane back from Hawai'i (where i've been hiding during the evolution of all this discussion).

After completing metapub's coverage of PubMed, I'm interested in starting a project with the same design constraints (i.e. high confidence, next-to-no actual scraping) that covers all journals that have DOIs. There is a lot of machinery in metapub that could apply to a broader swath of disciplines via the use of the CrossRef API and the dx.doi.org redirect.

I'm eager to get more involved with all of you!

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/8#issuecomment-161098227

mekarpeles commented 8 years ago

@cleegiles as an aside, the Internet Archive does OCR on any pdfs uploaded to them (I think this was @nthmost's plan, however I'm interested in hear your opinions). I'm sure many institutions would benefit from more contributions towards a more proficient library for extracting text from pdf.

Ideally, in the future, the .tex version of the paper will be available via something like github... But as frustrating as that is, I'll keep it contained as a separate issue :)

davidar commented 8 years ago

Ideally, in the future, the .tex version of the paper will be available via something like github...

That would be amazing, unfortunately arXiv seems to be the only ones distributing TeX sources currently

jbenet commented 8 years ago

Love what's going on in this thread

@davidar can't wait for a full offline-friendly experience of arxiv with ipfs+TeX.js :)

cleegiles commented 8 years ago

PSF to text extraction for "quality" information is still an open question. I would guess that IA is using PDFBox, a reasonable selection.

It's very important to know what tool that is being used for PDF to text extraction. There are many available. There are companies that make a living on their modifications of existing software or creating their own, i.e. gonitro.com. I would put their proprietary software as state of the art in comparison with Google. Many scientists are very concerned how data can be extracted from PDFs since this is the only place that some data exists and can be digitized - odd isn't it.

Most open source extractors are ok but not of high quality, this is one reason CiteSeerX's extraction flaws. We use either PDFBox or PDFlib TET (not open source). Google's is extremely good! Not released yet.

AI2 I've been told is about to release a very good converter. We haven't seen it, but many of the tools they've released so far have been excellent - we use them.

Interesting, surprisingly few scholarly papers are published in .tex, many are in .doc or .docx, especially in medicine, engineering or sciences outside of computer science and physics.

On 12/1/15 7:24 PM, Michael E. Karpeles wrote:

@cleegiles as an aside, the Internet Archive does OCR on any pdfs uploaded to them (I think this was @nthmost's plan, however I'm interested in hear your opinions). I'm sure many institutions would benefit from more contributions towards a more proficient library for extracting text from pdf.

Ideally, in the future, the .tex version of the paper will be available via something like github... But as frustrating as that is, I'll keep it contained as a separate issue :)

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/8#issuecomment-161140126

nthmost commented 8 years ago

@cleegiles my work's not focused on doing OCR, as the PDF format vis-a-vis people's usage of it in academic papers is pretty far removed from standardized; I'd rather do as @mekarpeles has suggested and upload PDFs to the Archive to be OCRed there.

That said, at my last job, we were able to use pdfminer (a Python library) to good effect to turn medical genetics papers (in English) into machine-indexable text. We built indexes over these texts and mapped mentions of important genetics concepts back to their pubmed IDs, so that medical concepts (referenced in the NIH medgen database) could be mapped to pubmed citation evidence.

nemobis commented 2 years ago

Sorry for being 6+ years late to the party. ;)

Reaching out to the repository admins to encourage them to expose their metadata correctly

COAR focuses on this part of the puzzle quite a bit. They have various ongoing initiatives like https://www.coar-repositories.org/news-updates/ccsd-and-coar-announce-plans-to-launch-preprint-directory/ .

extract cleaner metadata from HTML pages

https://scholar.archive.org/ is arguably doing just this: all the PDFs and other full text sources the Internet Archive finds with its scanning and crawling efforts get mined for academic works to index. Everyone go contribute! https://github.com/internetarchive/fatcat

mekarpeles commented 2 years ago

@nemobis +1! And to everyone else who continue to tireless further the space. I know @bnewbold et al have leveraged the amazing work of others in the community to build factact into another great resource in the space. Proud to watch these efforts mature and grateful for everyone's work.

OpenJournal / central

Indexing OAI-PMH non-compliant repositories #8