Open mekarpeles opened 8 years ago
Save yourself some work. Took me a week to gather and clean them all, back in August :)
@Jurnie how many sources does this include? Is there a list we can enumerate?
BASE has around 85,000,000 documents and ~4,000 sources (we're trying to find the ones which are missing). You can browse the list here: http://www.base-search.net/about/en/about_sources_date_dn.php?menu=2
cc: @pietsch
Everything that isn't a journal. Just look at the URL path.
@Jurnie, sorry which url path? I am trying to see a list of which sources (institutions) JURN has covered, the total number of sources are included, and how many documents are available.
Is this the list of sources? http://www.jurn.org/jurn-listoftitles.pdf or this http://www.jurn.org/directory/?
Thanks for your help
Ah, instructions required :) Ok, forget about JURN - I'm not giving you that. I'm giving you the GRAFT list. Go to the GRAFT link URL, the one that runs the search. MouseOver it. See that URL path, pointing to the HTML source that Google is using to power the on-the-fly CSE? Copy and load it. Right-click, 'View page source'.
Here's a group test of GRAFT, running it against the other public repository search tools. Albeit on a very hard search, so not many results for any of them.
Is this the link you're talking about? https://cse.google.com/tools/makecse?url=http%3A%2F%2Fwww.jurn.org%2Fgraft%2Findex4.html
Nearly. If this were a simple little test, re: if I should join your project or not, you wouldn't be doing very well at this point :) http://www.jurn.org/graft/index4.html and right-click, View source. All known repository URLs A-Z (bar archive.org and a couple of other mega-positories that would clutter results), up-to-date and thoroughly cleaned. Enjoy.
@Jurnie thanks, that worked. This is a great list, thanks for your efforts. It looks like there's just over 4,000 sources here. Would it be helpful for us to check this against BASE to see if there are any missing sources for you to add?
Of the 85 M, most seem to be not full text documents but metadata records.
Does anyone know how many full text documents are there?
On 1/29/16 7:11 PM, Michael E. Karpeles wrote:
@Jurnie how many sources does this include? Is there a list we can enumerate?
BASE has around 85,000,000 documents and ~4,000 sources (we're trying to find the ones which are missing). You can browse the list here: http://www.base-search.net/about/en/about_sources_date_dn.php?menu=2
cc: @pietsch
Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177027412
@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?
It's probably not that hard. The crawler just looks for pdf files and hopefully associated metadata only on those sites and nowhere else. Some sites prohibit crawling, however, with their robots.txt.
On 1/30/16 12:30 AM, Antonin wrote:
@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?
Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177075087
If we can get a list of the urls, we and AI2's Semantic Scholar will crawl for PDFs.
How do we go about getting it?
Best
Lee
On 1/30/16 12:30 AM, Antonin wrote:
@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?
Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177075087
That would be awesome! @cleegiles, I recommend getting in touch officially with BASE via their contact form to request their data. @pietsch, what do you think? I am happy to help with generating the list, with your permission of course.
@cleegiles, I have no idea how your pipeline works, but ideally it would be good if you could keep track of the relation between BASE's metadata and each PDF you download. The reason is that in my experience, BASE's metadata is cleaner than what you can extract from the PDF using heuristics.
BASE already includes CiteSeerX metadata, so of course we need to filter out these records first.
Hi @cleegiles, if all you need is a list of the URLs in BASE, there is no need to use the contact form. As you can see in https://github.com/ipfs/archives/issues/3, BASE has already released a data dump containing all URLs via IPFS. Unfortunately, all IPFS copies of this dump were destroyed. BASE is preparing a fresh, larger dump right now. It will be available in about a week. You can either wait for it, or I can prepare a list containing URLs only.
Does each URL point to a unique document? Does it point directly to a PDF?
How many URLs are there? 85M?
On 1/31/16 10:55 AM, Christian Pietsch wrote:
Hi @cleegiles, if all you need is a list of the URLs in BASE, there is no need to use the contact form. As you can see in https://github.com/ipfs/archives/issues/3, BASE has already released a data dump containing all URLs via IPFS. Unfortunately, all IPFS copies of this dump were destroyed. BASE is preparing a fresh, larger dump right now. It will be available in about a week. You can either wait for it, or I can prepare a list containing URLs only.
Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177532387
@cleegiles Each of these URLs belongs to one document. There may be duplicates if authors uploaded a document to several repositories. Many URLs point to an HTML landing page, others point to a PDF document, a few point to other document types. Based on a fresh dump, it should be 87M URLs. /cc @wetneb @davidar
Size will be about 100G or less?
I assume it will be compressed?
On 1/31/16 11:08 AM, Christian Pietsch wrote:
@cleegiles Each of these URLs belongs to one document. There may be duplicates if authors uploaded a document to several repositories. Many URLs point to an HTML landing page, others point to a PDF document, a few point to other document types. Based on a fresh dump, it should be 87M URLs. /cc @wetneb
Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177537636
I think I recall it being <275GB?
Uncompressed?
If necessary, we can put it on our Amazon storage.
On 1/31/16 4:13 PM, Michael E. Karpeles wrote:
I think I recall it being <275GB?
Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177612116
I believe it's compressed. @pietsch?
@cleegiles @mekarpeles As you can see in https://github.com/ipfs/archives/issues/3, the previous dump was 23 GB (gzipped XML, 79M records). So the new dump will still be smaller than 30 GB (compressed). Of course, if you just need a list of URLs, then the file will be tiny in comparison.
@cleegiles I re-sent an invitation to you to join to the Archive Labs slack channel -- several of us chat on there regarding OpenJournal in the #scholar channel.
Do we just ftp or rsync to download?
On 1/31/16 5:12 PM, Christian Pietsch wrote:
@cleegiles @mekarpeles The previous dump was more like 26 GB (gzipped XML, 79M records). So the new dump will still be smaller than 30 GB (compressed). Of course, if you just need a list of URLs, then the file will be tiny in comparison.
Reply to this email directly or view it on GitHub: https://github.com/OpenJournal/central/issues/9#issuecomment-177632286
BASE, openarchives, and others have a listing of their "sources". I plan to write a script which aggregates all of these into a single list.