a) Parse the tex file for \cite{a_ref}, then extract a_ref from bib file.
The only minor issue with this is that we might add \cite that have been commented out. A workaround would be to first remove all comments from the tex file, see arxiv-latex-cleaner.
In Java, download .jar from github. (tested with 1.13), and put WaveletArticle.pdf in /tmp/pdfs (for example, the script recursively search all pdfs in input and below folders).
This results in the file WaveletArticle.cermxml that honestly makes a better job than anystyle (a parser, see below). The article title is usually in the field article-title, and sometimes fails and ends in source.
The next step is to use the REST API from crossref. The API is public, so we can start working on it even before being Crossref members (on schedule).
With the output of the cermine parsing, we use a free-text query to the Crossref REST API. See GitHub - CrossRef/rest-api-doc: Documentation for Crossref's REST API
For a general query, use query.bibliographic under the work field.
Please notice that some characters need to be escaped in the url.
See HTML URL Encoding Reference for a reference.
Check status, check that top item, ordered by score, has a reasonable score value (TODO: which one?). The objective of this query is to get a DOI.
Then, use the the crossref DOI content negotiation to get that publication content in whatever format you want. See DOI Content Negotiation for options.
The score value used in #74 is 60. Why? Because it works good (no false positives, few false negatives) after extensive testing. Empirical value, treat with caution.
citation_list
is required by XML crossref for modern (<= 2 years) submissions. It's a list of the references that the publication is citing.We will extract it from either tex+bib file, or from PDF.
Resolving Citations (we don’t need no stinkin’ parser) - Crossref
From tex + bib:
\cite{a_ref}
, then extract a_ref from bib file. The only minor issue with this is that we might add\cite
that have been commented out. A workaround would be to first remove all comments from the tex file, see arxiv-latex-cleaner.From pdf:
This is hard.
GitHub - CeON/CERMINE: Content ExtRactor and MINEr is recommended by Crossref and used in production by OpenAIRE.
In Java, download .jar from github. (tested with 1.13), and put WaveletArticle.pdf in
/tmp/pdfs
(for example, the script recursively search all pdfs in input and below folders).This results in the file
WaveletArticle.cermxml
that honestly makes a better job than anystyle (a parser, see below). The article title is usually in the fieldarticle-title
, and sometimes fails and ends insource
.The next step is to use the REST API from crossref. The API is public, so we can start working on it even before being Crossref members (on schedule). With the output of the cermine parsing, we use a free-text query to the Crossref REST API. See GitHub - CrossRef/rest-api-doc: Documentation for Crossref's REST API
For a general query, use
query.bibliographic
under thework
field.Please notice that some characters need to be escaped in the url. See HTML URL Encoding Reference for a reference.
Example:
This will always give you results(!). Check the score value of each item.
See below a simplified response.
Check status, check that top item, ordered by score, has a reasonable score value (TODO: which one?). The objective of this query is to get a DOI.
Then, use the the crossref DOI content negotiation to get that publication content in whatever format you want. See DOI Content Negotiation for options.
For XML crossref use:
"Accept: application/vnd.crossref.unixref+xml"
Discarded solutions
anystyle (pdf parser)
From SO: Is it possible to extract the bibliography from a PDF file as a .bibtex? - TeX - LaTeX Stack Exchange The anwers point to anystyle (ruby): GitHub - inukshuk/anystyle: Fast and smart citation reference parsing
gem install anystyle-cli rexml
gem find article.pdf
returns a json. It seems pretty bad when testing it with my own IJ article: GitHub - phcerdan/InsightJournal-IsotropicWavelets: Template of Technical Report to be submitted to the Insight Journal