elifesciences / elife-crossref-feed

code to support uploading info to crossref on PAW articles
1 stars 1 forks source link

Criteria for including an <unstructured_citation> tag #116

Closed gnott closed 7 years ago

gnott commented 7 years ago

The current logic while adding support for the <unstructured_citation> tag is based on eLife citations which explicitly mention the publication-type, e.g.

<element-citation publication-type="patent">

Some additional logic that could be added around this could be based on the citation's values themselves. For example, if the citation has a uri or a comment, then include an <unstructured_citation> tag, since these particular details are not allowed elsewhere in the Crossref schema.

@Melissa37 do you have any thoughts you could add with respect to this and non-eLife examples?

gnott commented 7 years ago

Example is https://github.com/elifesciences/elife-crossref-feed/issues/88 If there is a comment, then convert any citation to unstructured_citation?

Melissa37 commented 7 years ago

I think the main criteria is that if there is a DOI (unless a data citation) retain as a structured reference, irrespective of what you can gleam or have to ignore from the JATS XML.

Then, if there is there are items such as comments and urls, add them in unstructured parts as you have in tests, eg:

xml
<citation key="12">
                        <volume_title>PyMol</volume_title>
                        <author>DeLano</author>
                        <cYear>2002</cYear>
                        <article_title>The PyMol Molecular Graphics System</article_title>
                        <unstructured_citation>DeLano W. 2002. The PyMol Molecular Graphics System.
                            Schrödinger LLC. PyMol. Version 1.7.4.
                            https://www.pymol.org/.</unstructured_citation>
                    </citation>

I did not realise you could mix an match structured information with unstructured information like that!

Melissa37 commented 7 years ago

This is the response from Crossref:

If you include both structured data and an unstructured citation, and the structured data is thorough enough to be parsed, then the unstructured citation will be totally ignored. If the structured data is missing some crucial element (e.g. no journal title), then the system will process the unstructured citation instead.

However, our system can only account for the quantity of structured citation data, not its quality. The problematic scenario is one where you've included enough structured citation data so that the structured citation is parsed, but it's poor quality or inaccurate metadata, (e.g. if you've spelled the journal title or author's name incorrectly; or included an incorrect page number, etc.) so the system will not be able to find a DOI match for that citation. We won't go on to try the unstructured citation in that case.

So, what that boils down to is: it's best to send both structured and unstructured citations unless you find that the metadata in your structured citations tends to be inaccurate or poorly formatted in such a way that it's preventing citation matches with the cited articles' DOIs. In that case, sending just the unstructured citations is preferable.

Melissa37 commented 7 years ago

My response to Crossref:

That's really helpful. All our references are crosschecked against the Crossref API for a DOI, so we "should" be picking up any crossref DOIs and supplying them in our metadata.

We also crosscheck PubMed API.

The issue is where DOIs are not registered by Crossref, or the content type is not a journal and the metadata might not be properly checked via your API.

Is your system checking just Crossref DOIs or other providers too?

I think the metadata in our structured citations tends to be pretty good, but of course it is improved a lot by using the PubMed and Crossref AOIs - chicken and egg scenario!

gnott commented 7 years ago

A note for a possible todo in the code is to create a configuration setting for unstructured_citation format.

If it is set to "hybrid" or to True (if we call it hybrid unstructured citation) then in the Crossref deposit it will include both the individual citation tags and the unstructure_citation tag (if applicable).

If the configuration value is set to False, then it will only include the unstructured_citation tag (when applicable) and not the other citation tags.

That will make the output flexible and configurable for other publishers and depending on the best practice for citation formats in the Crossref schema.

Melissa37 commented 7 years ago

So Crossref only check their internal DOI system:

Our system only checks for citation matches among Crossref DOIs. We can match non-journal content (books, conferences, etc.), though journal articles do make up the bulk of our metadata records.

Lets discuss what approach we take on our next call.

M

gnott commented 7 years ago

I think since we will include structured tags and unstructured_citation tags together for each citation, we can probably close this for now. We can change the logic later if we find the approach to be unsatisfactory.