Add citation_list to each publication

phcerdan commented 3 years ago

citation_list is required by XML crossref for modern (<= 2 years) submissions. It's a list of the references that the publication is citing.

We will extract it from either tex+bib file, or from PDF.

Resolving Citations (we don’t need no stinkin’ parser) - Crossref

From tex + bib:

a) Parse the tex file for \cite{a_ref}, then extract a_ref from bib file. The only minor issue with this is that we might add \cite that have been commented out. A workaround would be to first remove all comments from the tex file, see arxiv-latex-cleaner.
b) Remove unused (non-cited) articles from the bib file, and then just generate the citation list from the trimmed bib file. See https://tex.stackexchange.com/questions/43276/unused-bibliography-entries-how-to-check-which-entries-were-not-used, among others. Uses bibtool

From pdf:

This is hard.

GitHub - CeON/CERMINE: Content ExtRactor and MINEr is recommended by Crossref and used in production by OpenAIRE.

In Java, download .jar from github. (tested with 1.13), and put WaveletArticle.pdf in /tmp/pdfs (for example, the script recursively search all pdfs in input and below folders).

java -cp ~/Downloads/cermine-impl-1.13-jar-with-dependencies.jar \ 
pl.edu.icm.cermine.ContentExtractor -path /tmp/pdfs

This results in the file WaveletArticle.cermxml that honestly makes a better job than anystyle (a parser, see below). The article title is usually in the field article-title, and sometimes fails and ends in source.

The next step is to use the REST API from crossref. The API is public, so we can start working on it even before being Crossref members (on schedule). With the output of the cermine parsing, we use a free-text query to the Crossref REST API. See GitHub - CrossRef/rest-api-doc: Documentation for Crossref's REST API

For a general query, use query.bibliographic under the work field.

Please notice that some characters need to be escaped in the url. See HTML URL Encoding Reference for a reference.

Example:

<https://api.crossref.org/works?query.bibliographic=Carberry%2C+Josiah.+%E2%80%9CToward+a+Unified+Theory+of+High-Energy+Metaphysics%3A+Silly+String+Theory.%E2%80%9D+Journal+of+Psychoceramics+5.11+%282008%29%3A+1-3.#>

This will always give you results(!). Check the score value of each item.

See below a simplified response.

{
  "status": "ok",
  "message-type": "work-list",
  "message-version": "1.0.0",
  "message": {
    "facets": {},
    "total-results": 1903215,
    "items": [
      {
        "indexed": {
          "date-parts": [[2021, 5, 12]],
          "date-time": "2021-05-12T03:27:09Z",
          "timestamp": 1620790029774
        },
        "update-to": [
          {
            "updated": {
              "date-parts": [[2018, 1, 1]],
              "date-time": "2018-01-01T00:00:00Z",
              "timestamp": 1514764800000
            },
            "DOI": "10.5555/12345678",
            "type": "corrigendum",
            "label": "Corrigendum"
          }
        ],
        "reference-count": 1,
        "publisher": "Society of Psychoceramics",

        "score": 122.59374,
        "issued": { "date-parts": [[2008, 8, 13]] },
        "references-count": 1,
        "journal-issue": {
          "issue": "11",
          "published-online": { "date-parts": [[2008, 2, 29]] },
          "published-print": { "date-parts": [[2008, 2, 29]] }
        },
        "URL": "http://dx.doi.org/10.5555/12345678",
        "DOI": "10.5555/12345678",
        "type": "journal-article",
      },
      #More 
    ],
    "items-per-page": 20,
    "query": { "start-index": 0, "search-terms": null }
  }
}

Check status, check that top item, ordered by score, has a reasonable score value (TODO: which one?). The objective of this query is to get a DOI.

Then, use the the crossref DOI content negotiation to get that publication content in whatever format you want. See DOI Content Negotiation for options.

curl -LH "Accept: application/x-bibtex" http://dx.doi.org/10.5555/12345678

@article{Carberry_2008,
        doi = {10.5555/12345678},
        url = {https://doi.org/10.5555%2F12345678},
        year = 2008,
        month = {aug},
        publisher = {Society of Psychoceramics},
        volume = {5},
        number = {11},
        pages = {1--3},
        author = {Josiah Carberry},
        title = {Toward a Unified Theory of High-Energy Metaphysics: Silly String Theory},
        journal = {Journal of Psychoceramics}
}%

For XML crossref use: "Accept: application/vnd.crossref.unixref+xml"

Discarded solutions

anystyle (pdf parser)

From SO: Is it possible to extract the bibliography from a PDF file as a .bibtex? - TeX - LaTeX Stack Exchange The anwers point to anystyle (ruby): GitHub - inukshuk/anystyle: Fast and smart citation reference parsing gem install anystyle-cli rexml gem find article.pdf returns a json. It seems pretty bad when testing it with my own IJ article: GitHub - phcerdan/InsightJournal-IsotropicWavelets: Template of Technical Report to be submitted to the Insight Journal

phcerdan commented 3 years ago

See script from: https://github.com/phcerdan/insightjournal-dev/commit/6e62ea609fefb49b4606e3177f05c046d9bb8f93

phcerdan commented 3 years ago

The score value used in #74 is 60. Why? Because it works good (no false positives, few false negatives) after extensive testing. Empirical value, treat with caution.

InsightSoftwareConsortium / InsightJournal