Sparse data for citation

essepuntato / opencitations

OpenCitations provides in RDF accurate citation information harvested from the scholarly literature.

http://opencitations.net

ISC License

64 stars 3 forks source link

Sparse data for citation #8

Closed moqri closed 7 years ago

moqri commented 7 years ago

I am accessing your citation data through the SPARQL endpoint and it seems many journals have very limited number of articles with citation information available! Am I missing something?

BTW, I wrote a quick wrapper for your SPQRQL which might be easier than working with the queries:

http://brdb.warrington.ufl.edu/oc/

essepuntato commented 7 years ago

Hi @moqri,

According to today's statistics, we have ingested citations coming from 131,390 articles. Thus, we are still far from having a full coverage of all the journals available indeed - even if things could change soon. Can I have the SPARQL query you are performing in your wrapper, since it seems it is hidden behind the code? Sorry to ask, but I would love to know how you are querying the OCC triplestore.

Currently, the process of ingesting new articles starts from those available from the Open Access Dataset of PubMed Central, while we are preparing the infrastructure to get more citation information from other services. We are currently using Crossref for disambiguating and retrieving metadata about the cited articles. Please refer to the Corpus page in the website if you want to have a more precise view of how the ingestion workflow works right now.

Thanks again for your interest and work. Have a nice day :-)

moqri commented 7 years ago

Thanks for your quick response, @essepuntato , Your point totally explains that why I don't see all the articles for each journal/volume.

I am using a simple node js for the wrapper. I put it here with all the sparql queries: https://github.com/moqri/OpenCitations/blob/master/oc/index.js

(I couldn't find an endpoint to your data that returns JSON so I used XML and parsed it manually.)

I would really appreciate if you could keep us posted here when the whole (14 million?) citation data is ready in RDF. We would love to add/link to your citation data for the collection of articles (107,000 at the moment) in our Business Research Database (http://brdb.warrington.ufl.edu/)

Also, please let me know if I can help with anything Open Science related. I am a big fan and believer :)

Best, Mahdi

gneissone commented 7 years ago

@moqri You can append &format=json to your request, for example http://opencitations.net/sparql?query=SELECT+*+WHERE+%7B%0A%3Fs+%3Fp+%2210.1097%2Figc.0000000000000609%22+.%0A%3Fs+%3Chttp%3A%2F%2Fwww.essepuntato.it%2F2010%2F06%2Fliteralreification%2FhasLiteralValue%3E+%3Fo+.%0A%09%7D%0A&format=json

I couldn't find it documented, but sure glad it worked!

essepuntato commented 7 years ago

Hi both,

@gneissone, thanks for your answer, you just anticipated me :-)

@moqri, we are using Blazegraph storing all the data. Since all the literals are also indexed, I think you could use also full text search feature, which should be faster:

https://wiki.blazegraph.com/wiki/index.php/FullTextSearch

Have a nice day :-)

moqri commented 7 years ago

Is there a timeline on how many articles will be added like by the end of summer of the end of the year?

It seems to me with the current speed, it might take years to have all the articles in the DOI ingested into your system. Any reason/bottleneck that is slowing than your wonderful work?

essepuntato commented 7 years ago

Hi @moqri,

It is quite difficult to have a clear prediction of what you are asking for. Some insights are described in the presentation we have just done at WikiCite 2017 - see https://www.slideshare.net/essepuntato/opencitations - but honestly a clear number is difficult to have.

The main bottleneck is due to the current infrastructure, which is a quite small virtual machine with 4 cores and 12 GB of RAM. However, in October we will enhance the whole infrastructure, and thus the ingestion rate would become 500000 new citation links per day - consider that in the current infrastructure we reach that amount per month.