Open dhimmel opened 6 years ago
I created a repository for processing the OpenCitations figshare datasets: greenelab/opencitations
. From the 2017-07-25
release (specifically the corpus_id
and corpus_br
components), I created a TSV of DOI-to-DOI citations as proposed above. It's available from the file citations-doi.tsv.xz
.
Here are the stats we generated for this dataset:
7,574,387 total DOI-to-DOI citations 203,264 DOIs with outgoing DOI citations 3,946,611 DOIs with incoming DOI citations
I was surprised that references are only available for ~200,000 articles. Why is this number so low? Does Crossref possess references for more articles (which they now return via their API) or is Crossref a downstream user of OpenCitations?
Also I didn't see the purpose for using Disk ARchive on the data exports. The figshare files are zipped, so what's the purpose of this extra archiving step, that creates dependency on the antiquated dar
program?
Hi @dhimmel
Thanks for this. I think it is incredible useful indeed. I've already tweeted about it on the Twitter OpenCitations account:
https://twitter.com/opencitations/status/900609593998544896
In the next months, after the launch of the new infrastructure, I would like to include your script within the OpenCitations repository, if you are fine with it, so as to release such information on monthly basis, as highlighted in this issue. What do you think?
Coming to your questions:
200.000 are the citing articles that have been processed and that are contained in the PubMed Central Open Access datasets. Crossref indeed contains larger number of reference lists available now (thanks to I4OC), but want to wait to have the new OpenCitations infrastructure before starting to gather information also from there. Currently we use Crossref API retrieving additional metadata information about all the citing/cited articles, in particular: title, subtitle, identifiers (e.g. DOI, ISSN, ISBN, URL, and Crossref member URL), author list, publisher, container resources (issue, volume, journal, book, etc.), publication year, pages. In addition, we also use their API for disambiguating bibliographic resources and agents by means of the identifiers retrieved.
The use of DAR as mechanism for packaging items is very useful for backups, since it also allows us to implement a daily incremental backup. However, I see the issue in terms of accessibility. To this end, we plan to expose dumps also in n-quads format monthly (see #16) – to date, we have only experimented it by publishing on Figshare (https://doi.org/10.6084/m9.figshare.5147068) the n-quads zipped version of the full corpus of the April 2017 dump. When the new infrastructure will be up and running, maybe some changes are possible. Not sure if we will abandon DAR, though, since it works quite well for addressing the incremental backup issue. But this is something that we should discuss in the next months.
after the launch of the new infrastructure, I would like to include your script within the OpenCitations repository
That would be great! greenelab/opencitations
is released under CC0, so it can be used anywhere. When initially copying the code over, I'd appreciate if you set the git commit author to:
--author="Daniel Himmelstein <daniel.himmelstein@gmail.com>"
See also Issue 7 about incorrect DOIs.
Greetings, a while ago I posed issue https://github.com/essepuntato/opencitations/issues/1 about downloading the OpenCitations network. Great to see http://opencitations.net/download is now available! Congrats on the milestone.
At the moment, I'm looking for a minimalist encoding of the DOI citation network. The most basic format I can think of would be tabular like:
The first row indicates that
10.1371/journal.pcbi.1004259
cites10.1111/j.2041-210X.2010.00012.x
.Do the OpenCitation downloads easily expose the DOI citation network? Is this table something you would consider adding to the OpenCitations release pipeline? I suspect many users just care about this information and can forgo lot's of complexity.