Script/endpoint to aggregate coverage of sources across sources

mekarpeles commented 8 years ago

BASE, openarchives, and others have a listing of their "sources". I plan to write a script which aggregates all of these into a single list.

Jurnie commented 8 years ago

Save yourself some work. Took me a week to gather and clean them all, back in August :)

mekarpeles commented 8 years ago

@Jurnie how many sources does this include? Is there a list we can enumerate?

BASE has around 85,000,000 documents and ~4,000 sources (we're trying to find the ones which are missing). You can browse the list here: http://www.base-search.net/about/en/about_sources_date_dn.php?menu=2

cc: @pietsch

Jurnie commented 8 years ago

Everything that isn't a journal. Just look at the URL path.

mekarpeles commented 8 years ago

@Jurnie, sorry which url path? I am trying to see a list of which sources (institutions) JURN has covered, the total number of sources are included, and how many documents are available.

Is this the list of sources? http://www.jurn.org/jurn-listoftitles.pdf or this http://www.jurn.org/directory/?

Thanks for your help

Jurnie commented 8 years ago

Ah, instructions required :) Ok, forget about JURN - I'm not giving you that. I'm giving you the GRAFT list. Go to the GRAFT link URL, the one that runs the search. MouseOver it. See that URL path, pointing to the HTML source that Google is using to power the on-the-fly CSE? Copy and load it. Right-click, 'View page source'.

Jurnie commented 8 years ago

Here's a group test of GRAFT, running it against the other public repository search tools. Albeit on a very hard search, so not many results for any of them.

mekarpeles commented 8 years ago

Is this the link you're talking about? https://cse.google.com/tools/makecse?url=http%3A%2F%2Fwww.jurn.org%2Fgraft%2Findex4.html

Jurnie commented 8 years ago

Nearly. If this were a simple little test, re: if I should join your project or not, you wouldn't be doing very well at this point :) http://www.jurn.org/graft/index4.html and right-click, View source. All known repository URLs A-Z (bar archive.org and a couple of other mega-positories that would clutter results), up-to-date and thoroughly cleaned. Enjoy.

mekarpeles commented 8 years ago

@Jurnie thanks, that worked. This is a great list, thanks for your efforts. It looks like there's just over 4,000 sources here. Would it be helpful for us to check this against BASE to see if there are any missing sources for you to add?

cleegiles commented 8 years ago

Of the 85 M, most seem to be not full text documents but metadata records.

Does anyone know how many full text documents are there?

On 1/29/16 7:11 PM, Michael E. Karpeles wrote:

@Jurnie how many sources does this include? Is there a list we can enumerate?

BASE has around 85,000,000 documents and ~4,000 sources (we're trying to find the ones which are missing). You can browse the list here: http://www.base-search.net/about/en/about_sources_date_dn.php?menu=2

cc: @pietsch

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177027412

wetneb commented 8 years ago

@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?

cleegiles commented 8 years ago

It's probably not that hard. The crawler just looks for pdf files and hopefully associated metadata only on those sites and nowhere else. Some sites prohibit crawling, however, with their robots.txt.

On 1/30/16 12:30 AM, Antonin wrote:

@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177075087

cleegiles commented 8 years ago

If we can get a list of the urls, we and AI2's Semantic Scholar will crawl for PDFs.

How do we go about getting it?

Best

Lee

On 1/30/16 12:30 AM, Antonin wrote:

@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177075087

wetneb commented 8 years ago

That would be awesome! @cleegiles, I recommend getting in touch officially with BASE via their contact form to request their data. @pietsch, what do you think? I am happy to help with generating the list, with your permission of course.

@cleegiles, I have no idea how your pipeline works, but ideally it would be good if you could keep track of the relation between BASE's metadata and each PDF you download. The reason is that in my experience, BASE's metadata is cleaner than what you can extract from the PDF using heuristics.

BASE already includes CiteSeerX metadata, so of course we need to filter out these records first.

pietsch commented 8 years ago

Hi @cleegiles, if all you need is a list of the URLs in BASE, there is no need to use the contact form. As you can see in https://github.com/ipfs/archives/issues/3, BASE has already released a data dump containing all URLs via IPFS. Unfortunately, all IPFS copies of this dump were destroyed. BASE is preparing a fresh, larger dump right now. It will be available in about a week. You can either wait for it, or I can prepare a list containing URLs only.

cleegiles commented 8 years ago

Does each URL point to a unique document? Does it point directly to a PDF?

How many URLs are there? 85M?

On 1/31/16 10:55 AM, Christian Pietsch wrote:

Hi @cleegiles, if all you need is a list of the URLs in BASE, there is no need to use the contact form. As you can see in https://github.com/ipfs/archives/issues/3, BASE has already released a data dump containing all URLs via IPFS. Unfortunately, all IPFS copies of this dump were destroyed. BASE is preparing a fresh, larger dump right now. It will be available in about a week. You can either wait for it, or I can prepare a list containing URLs only.

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177532387

pietsch commented 8 years ago

@cleegiles Each of these URLs belongs to one document. There may be duplicates if authors uploaded a document to several repositories. Many URLs point to an HTML landing page, others point to a PDF document, a few point to other document types. Based on a fresh dump, it should be 87M URLs. /cc @wetneb @davidar

cleegiles commented 8 years ago

Size will be about 100G or less?

I assume it will be compressed?

On 1/31/16 11:08 AM, Christian Pietsch wrote:

@cleegiles Each of these URLs belongs to one document. There may be duplicates if authors uploaded a document to several repositories. Many URLs point to an HTML landing page, others point to a PDF document, a few point to other document types. Based on a fresh dump, it should be 87M URLs. /cc @wetneb

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177537636

mekarpeles commented 8 years ago

I think I recall it being <275GB?

cleegiles commented 8 years ago

Uncompressed?

If necessary, we can put it on our Amazon storage.

On 1/31/16 4:13 PM, Michael E. Karpeles wrote:

I think I recall it being <275GB?

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177612116

mekarpeles commented 8 years ago

I believe it's compressed. @pietsch?

pietsch commented 8 years ago

@cleegiles @mekarpeles As you can see in https://github.com/ipfs/archives/issues/3, the previous dump was 23 GB (gzipped XML, 79M records). So the new dump will still be smaller than 30 GB (compressed). Of course, if you just need a list of URLs, then the file will be tiny in comparison.

mekarpeles commented 8 years ago

@cleegiles I re-sent an invitation to you to join to the Archive Labs slack channel -- several of us chat on there regarding OpenJournal in the #scholar channel.

cleegiles commented 8 years ago

Do we just ftp or rsync to download?

On 1/31/16 5:12 PM, Christian Pietsch wrote:

@cleegiles @mekarpeles The previous dump was more like 26 GB (gzipped XML, 79M records). So the new dump will still be smaller than 30 GB (compressed). Of course, if you just need a list of URLs, then the file will be tiny in comparison.

Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177632286

OpenJournal / central

Script/endpoint to aggregate coverage of sources across sources #9