Sending Clinical trial metadata to Crossref

Melissa37 commented 4 years ago

Problem / Motivation

Production wants to send clinical trial data to Crossref so this information is available to users of their API and we are helping with the need to track clinical trial

Proposed solution

eLife add CrossMark metadata to our deposits:

Linked Clinical trials Dependency - updated structured abstracts: https://github.com/elifesciences/issues/issues/4622

Crossref documentation: https://www.crossref.org/education/crossmark/linked-clinical-trials/ These fields should be included within the custom metadata section of the Crossmark deposit

<clinicaltrial_data>
<doi>10.5555/12345678</doi>
<ct:program>
<ct:clinical-trial-number registry="10.18810/isrctn" type="results">ISRCTN1234</ct:clinical-trial-number>
<ct:clinical-trial-number registry="10.18810/isrctn" type="results">ISRCTN9999</ct:clinical-trial-number>
</ct:program>
</clinicaltrial_data>

Generally our articles seem to only link to one clinical trial, but multiple can be added. I assume the DOI listed is the DOI of this paper.

From Crossref re the need for a DOI

As we advise for users supplying Crossmark data simply as a workaround to make their free-to-read content visible in our API, you could just use the DOI that's being updated in the tags as well as in the . The later is more of a 'hack' than the former, but since we're in the midst of changing approaches to this kind of update metadata, it's still a fine option. If you want to register a DOI for your updates policies that's great, but we don't want that extra step to discourage anyone from sending in the updates metadata.

<crossmark>
<crossmark_policy>10.5555/crossmark_policy</crossmark_policy>
<crossmark_domains>
<crossmark_domain>
<domain>psychoceramics.labs.crossref.org</domain>
</crossmark_domain>
</crossmark_domains>
<crossmark_domain_exclusive>true</crossmark_domain_exclusive>

Clarification needed and assumptions

Deposit Crossmark metadata - as part of your regular Crossref metadata deposit, and can also be deposited as stand-alone data to populate backfiles. For Crossmark-only deposits, see the schema and schema documentation relating to resource-only deposits. https://www.crossref.org/education/metadata-stewardship/maintaining-your-metadata/adding-metadata-to-an-existing-record/

This does not seem to support our use case and we'd be better off just re-depositing everything?

Question: 'true'. What does this refer to? Is eLife True or False?

Tasks

[x] Add or check that clinicaltrials.xsd DTD is being added as an XML namespace - Yes, we already do that now
[x] Parse clinical trials data from the XML into an Article object
[x] How do we / can we add clinical trials data without Crossmark enabled? (I think so) Answer: No, it looks like they are only added into the Crossmark section of the deposit
[x] If Crossmark is enabled, then clinical trials data goes inside the < custom_metadata> tag (I believe) Answer: Correct.
[x] Create a sample Crossref deposit that includes clinical trial data and test it for validity against the Crossref schema
[x] Code to parse http://api.crossref.org/works/10.18810/registries/transform/application/vnd.crossref.unixsd+xml registry file format and convert into a source-id to doi map
[x] Use http://doi.org/10.18810/registries as the URI of the registries XML, is probably the safest
[x] Test scenario for parsing the XML registry file, but only use an abbreviated or faked registry XML, so it is not confused with real data that can be used in generating Crossref deposits - a fresh registry XML file should be downloaded from Crossref for real situations
[x] Test case for clinical trial of crossref-doi type, instead of registry-name type
[x] Tests for pre-results getting converted to preResults
[x] Code to add clinical trials data to the Crossref deposit XML
[x] Do we need to set any additional library configuration or preferences to enable this? Do we want to allow users to specify whether clinical trials are deposited or not, or will the default be to always deposit the value if it is available? Answer: For now it will deposit any clinical trials it finds by default if Crossmark program is enabled
[x] PR and merge the code change
[x] Deploy to the prod environment
[ ] Try it out, if we have article XML that is suitable to set the clinical trial data attribute

Technical notes

Here are some of my notes and thoughts, for discussion:

I think structured abstracts could possibly be added to the article data structure used by the Crossref generation library without involving integration with other data schemas

Clinical trial data would be added as a new property of an Article object, and then we can include that in Crossref deposits

Crossmark related:

Code in the old, archived, Crossref generation library: For defining the Crossmark policy and domain (https://github.com/elifesciences/elife-poa-xml-generation/blob/develop/generateCrossrefXml.py#L34-L35) Old code that added Crossmark XML to a Crossref deposit, but it was never used for real I think https://github.com/elifesciences/elife-poa-xml-generation/blob/develop/generateCrossrefXml.py#L219-L236 Perhaps not all articles would need to be deposited with Crossmark data, but my guess is if we want to register a Correction, for example, the article that is being corrected would need to be deposited with Crossmark, and then the correction article as well afterward XML and testing

For clinial trials support, need to add XML schema prefix to the Crossref XML deposit, e.g. xmlns:ct="http://www.crossref.org/clinicaltrials.xsd" Add additional settings for Crossmark into the elifecrossref library .cfg file to turn on/off Crossmark deposits, specify the Crossmark domain and Crossmark policy DOI For Crossmark, test exam

User interface / Wireframes

Melissa37 commented 4 years ago

@gnott This is the clinical trial ticket

gnott commented 4 years ago

Note to self: I had some comments about clinical trials in comment https://github.com/elifesciences/elife-crossref-feed/issues/145#issuecomment-623795752, about adding data to the Article() object and to adding the Crossref clinical trials DTD to the generated deposit XML.

gnott commented 4 years ago

From https://www.crossref.org/education/crossmark/linked-clinical-trials/,

The relationship of the publication to the clinical trial (optional) This field is optional but encouraged. The three allowed elements are “pre-results”, “results” and “post-results”, indicating which stage of the trial the publication is reporting on.

@Melissa37, would we ever have this value in the XML, or anticipate it would be known whether a clinical trial had a status of these types?

I would also like to think the Crossref deposit library should consider making this an option to specify, even if eLife is not using these values.

gnott commented 4 years ago

I'm reading JATS4R recommendation, the @content-type attribute of the <related-object> tag looks to hold this data, so when I configure parsing article XML and populating the clinical trials data of an Article, I will add a sample with that level of detail.

gnott commented 4 years ago

I have a valid manually composed deposit, the XML contains this:

<custom_metadata>
...
  <ct:program>
    <ct:clinical-trial-number registry="10.18810/clinical-trials-gov">NCT02836002</ct:clinical-trial-number>
 </ct:program>
</custom_metadata>

I see now there will be a little more scope than I expected, because we need to match up the registry name from the article XML with the list of registries Crossref maintains at http://api.crossref.org/works/10.18810/registries/transform/application/vnd.crossref.unixsd+xml in order to get the DOI of the registry.

To do the matching, I think I'll add in some logic into the Crossref library to parse the registry XML file, use an example file for testing purposes, and when the Crossref library is incorporated into a workflow, we can download a fresh copy of the registry XML prior to populating the clinical trial data for the article, if the article has any clinical trials. I want to avoid saving a copy of the registry XML as it is today into the project, because it will eventually be out-of-date, and we should always rely on the live registry file when generating real Crossref deposits.

gnott commented 4 years ago

Making a note too that if I changed in my sample registry="10.18810/clinical-trials-gov" to registry="10.18810/foo", it is not rejected immediately by the Crossref XML validity checker. I don't know what the Crossref ingestion queue would do if the DOI doesn't match the registry they maintain. We'll assume for now that only the registry names we can match to Crossref's registry are the ones we will include in the Crossref deposit.

Melissa37 commented 4 years ago

The relationship of the publication to the clinical trial (optional) This field is optional but encouraged. The three allowed elements are “pre-results”, “results” and “post-results”, indicating which stage of the trial the publication is reporting on. @Melissa37, would we ever have this value in the XML, or anticipate it would be known whether a clinical trial had a status of these types? I would also like to think the Crossref deposit library should consider making this an option to specify, even if eLife is not using these values.

I remember when this was all discussed on the Crossref working group implementing this - it was all medical journals

We've only just started looking into Medicine and the starting point was getting abstracts to match what other medical journals are doing.

@mariajoaoguerreiro might have a view on whether we'll be recording this in the future but for now it's not something eLife can do.

I'm reading JATS4R recommendation, the @content-type attribute of the <related-object> tag looks to hold this data, so when I configure parsing article XML and populating the clinical trials data of an Article, I will add a sample with that level of detail.

Cool, makes sense to future proof for eLife but make it work for those already doing this

I see now there will be a little more scope than I expected, because we need to match up the registry name from the article XML with the list of registries Crossref maintains at http://api.crossref.org/works/10.18810/registries/transform/application/vnd.crossref.unixsd+xml in order to get the DOI of the registry. To do the matching, I think I'll add in some logic into the Crossref library to parse the registry XML file, use an example file for testing purposes, and when the Crossref library is incorporated into a workflow, we can download a fresh copy of the registry XML prior to populating the clinical trial data for the article, if the article has any clinical trials. I want to avoid saving a copy of the registry XML as it is today into the project, because it will eventually be out-of-date, and we should always rely on the live registry file when generating real Crossref deposits.

Ah, good point, I had forgotten about that. @FAtherden-eLife could you correspond with @gnott on this so we get some Schematron validation in place too?

Making a note too that if I changed in my sample registry="10.18810/clinical-trials-gov" to registry="10.18810/foo", it is not rejected immediately by the Crossref XML validity checker. I don't know what the Crossref ingestion queue would do if the DOI doesn't match the registry they maintain. We'll assume for now that only the registry names we can match to Crossref's registry are the ones we will include in the Crossref deposit.

Yeah, makes sense, but what if they update that list? Should I check where they are notifying people of new releases? For instance the Open Funder Registry gets new irregular releases that we update in our systems.

mariajoaoguerreiro commented 4 years ago

@Melissa37 Yes, I'd agree with you.

gnott commented 4 years ago

... new releases?

The registry XML has this value <crm-item name="last-update" type="date">2020-04-07T11:31:23Z</crm-item> which might be helpful to detect new versions, but as for how or whether Crossref notifies people about a new release I could not say.

gnott commented 4 years ago

A question perhaps for @FAtherden-eLife, a question I have is: if you look at the registry XML file, for the one eLife example I have which uses ClinicalTrials.gov as the registry name, that value is used as both the <title> and <subtitle> for that registry.

If you were to add a clinical trial for one of the other registries, would you be using the <title> or <subtitle> in the article XML (which is what I'd use to match and find the DOI for that registry)?

For example, in the <related-object> tag, would you have source-id="EU Clinical Trials Register" or source-id="EU-CTR" for that registry?

fred-atherden commented 4 years ago

@gnott, my position would be that we should be using the subtitle for the source-id attribute value, so source-id="EU-CTR" would be correct/expected.

We can control the list of allowed source-id values based on that XML file, via Schematron, so that no others should come through from production.

gnott commented 4 years ago

I got to a point yesterday where I was a little stuck on processing the @content-type attribute, because JATS4R may recommend a value like pre-results but Crossref schema accepts the value preResults. I've just realised, whichever is chosen for the article XML, potentially validated by Schematron, it won't matter to me as long as I make sure the value translation supports both values: if preResults, use preResults, if pre-results use preResults in the Crossref deposit.

Melissa37 commented 4 years ago

I got to a point yesterday where I was a little stuck on processing the @content-type attribute, because JATS4R may recommend a value like pre-results but Crossref schema accepts the value preResults. I've just realised, whichever is chosen for the article XML, potentially validated by Schematron, it won't matter to me as long as I make sure the value translation supports both values: if preResults, use preResults, if pre-results use preResults in the Crossref deposit.

Yeah, JATS4R has attribute guidance and it differs from how Crossref works, so some mapping would have to happen.

This is for the benefit of all publishers using our tool though, right? As we don't have this level of detail!

gnott commented 4 years ago

Yes, the @content-type attribute I want to add to a test scenario sample just so it is covered and it is simple to add, even if not used (yet) in eLife XML.

gnott commented 4 years ago

New issue https://github.com/elifesciences/issues/issues/5830 to be a reminder to test this out or check the results when clinical trials data is available for eLife articles.

elifesciences / elife-crossref-feed