essepuntato / opencitations

OpenCitations provides in RDF accurate citation information harvested from the scholarly literature.
http://opencitations.net
ISC License
64 stars 3 forks source link

Availability of a minimalist DOI citation graph #21

Open dhimmel opened 6 years ago

dhimmel commented 6 years ago

Greetings, a while ago I posed issue https://github.com/essepuntato/opencitations/issues/1 about downloading the OpenCitations network. Great to see http://opencitations.net/download is now available! Congrats on the milestone.

At the moment, I'm looking for a minimalist encoding of the DOI citation network. The most basic format I can think of would be tabular like:

source cited
10.1371/journal.pcbi.1004259 10.1111/j.2041-210X.2010.00012.x
10.1371/journal.pcbi.1004259 10.1002/ana.22609
10.7287/peerj.preprints.3100v1 10.1007/s11192-016-2225-6

The first row indicates that 10.1371/journal.pcbi.1004259 cites 10.1111/j.2041-210X.2010.00012.x.

Do the OpenCitation downloads easily expose the DOI citation network? Is this table something you would consider adding to the OpenCitations release pipeline? I suspect many users just care about this information and can forgo lot's of complexity.

dhimmel commented 6 years ago

DOI Citation Catalog

I created a repository for processing the OpenCitations figshare datasets: greenelab/opencitations. From the 2017-07-25 release (specifically the corpus_id and corpus_br components), I created a TSV of DOI-to-DOI citations as proposed above. It's available from the file citations-doi.tsv.xz.

Here are the stats we generated for this dataset:

7,574,387 total DOI-to-DOI citations 203,264 DOIs with outgoing DOI citations 3,946,611 DOIs with incoming DOI citations

I was surprised that references are only available for ~200,000 articles. Why is this number so low? Does Crossref possess references for more articles (which they now return via their API) or is Crossref a downstream user of OpenCitations?

Also I didn't see the purpose for using Disk ARchive on the data exports. The figshare files are zipped, so what's the purpose of this extra archiving step, that creates dependency on the antiquated dar program?

essepuntato commented 6 years ago

Hi @dhimmel

Thanks for this. I think it is incredible useful indeed. I've already tweeted about it on the Twitter OpenCitations account:

https://twitter.com/opencitations/status/900609593998544896

In the next months, after the launch of the new infrastructure, I would like to include your script within the OpenCitations repository, if you are fine with it, so as to release such information on monthly basis, as highlighted in this issue. What do you think?

Coming to your questions:

  1. 200.000 are the citing articles that have been processed and that are contained in the PubMed Central Open Access datasets. Crossref indeed contains larger number of reference lists available now (thanks to I4OC), but want to wait to have the new OpenCitations infrastructure before starting to gather information also from there. Currently we use Crossref API retrieving additional metadata information about all the citing/cited articles, in particular: title, subtitle, identifiers (e.g. DOI, ISSN, ISBN, URL, and Crossref member URL), author list, publisher, container resources (issue, volume, journal, book, etc.), publication year, pages. In addition, we also use their API for disambiguating bibliographic resources and agents by means of the identifiers retrieved.

  2. The use of DAR as mechanism for packaging items is very useful for backups, since it also allows us to implement a daily incremental backup. However, I see the issue in terms of accessibility. To this end, we plan to expose dumps also in n-quads format monthly (see #16) – to date, we have only experimented it by publishing on Figshare (https://doi.org/10.6084/m9.figshare.5147068) the n-quads zipped version of the full corpus of the April 2017 dump. When the new infrastructure will be up and running, maybe some changes are possible. Not sure if we will abandon DAR, though, since it works quite well for addressing the incremental backup issue. But this is something that we should discuss in the next months.

dhimmel commented 6 years ago

after the launch of the new infrastructure, I would like to include your script within the OpenCitations repository

That would be great! greenelab/opencitations is released under CC0, so it can be used anywhere. When initially copying the code over, I'd appreciate if you set the git commit author to:

--author="Daniel Himmelstein <daniel.himmelstein@gmail.com>"
davidshotton commented 6 years ago

See also Issue 7 about incorrect DOIs.