Adding dataset information

Melissa37 commented 7 years ago

In future, it would be good to take anything listed in the Major datasets generated section:

xml
<sec sec-type="datasets" id="s7">
                <title>Major datasets</title>
                <p>The following datasets were generated:</p>
                <p>
                    <related-object content-type="generated-dataset"
                        source-id="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73147"
                        source-id-type="uri" id="dataset1">
                        <collab collab-type="author">Mishra A</collab>
                        <collab collab-type="author">Pisco AO</collab>
                        <collab collab-type="author">Watt FM</collab>
                        <year>2017</year>
                        <source>A protein phosphatase network controls the temporal and spatial
                            dynamics of differentiation commitment in human epidermis</source>
                        <ext-link ext-link-type="uri"
                            xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73147"
                            >https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73147</ext-link>
                        <comment>Publicly available at the NCBI Gene Expression Omnibus (accession
                            no. GSE73147)</comment>
                    </related-object>
                </p>
                <p>
                    <related-object content-type="generated-dataset"
                        source-id="http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD003281"
                        source-id-type="uri" id="dataset2">
                        <collab collab-type="author">Mishra A</collab>
                        <collab collab-type="author">Pisco AO</collab>
                        <collab collab-type="author">Watt FM</collab>
                        <year>2017</year>
                        <source>A protein phosphatase network controls the temporal and spatial
                            dynamics of differentiation commitment in human epidermis</source>
                        <ext-link ext-link-type="uri"
                            xlink:href="http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD003281"
                            >http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD003281</ext-link>
                        <comment>Publicly available at ProteomeXchange (accession no.
                            PXD003281)</comment>
                    </related-object>
                </p>
            </sec>

and convert, example:

xml
For example:
<Object Type="NCBI:geo">
<Param Name="id">GSE73147</Param>
</Object>

However, in this section I don't think we are storing the information in a way it can be parsed for this data, WDYT @gnott ? This links t the work ongoing regarding Data Availability Statements.

Plan I have as at Feb 1, 2019:

[x] Parse the assigning-authority value from datasets in the article XML as a new value in the datasets JSON output named assigningAuthority
[x] Also may want to enhance the parser to extract dataId values from the pub-id tag when appropriate (currently it only takes them from object-id "art-access-id" type tags)
[x] Add a new property to the elife-article Dataset object, something like assigning_authority
[x] Parse from the datasets JSON the assigningAuthority value to populate Dataset objects assigning_authority value when parsing from XML to eLife article objects
[x] In elife-pubmed-xml-generation, add <Object> tags for the datasets, including either the DOI value or accession id / dataId value, as is specified in the original article XML
[x] Use the most appropriate value for the <Object> <Param Name="type"> value in the PubMed deposit for where the dataset is located

Melissa37 commented 7 years ago

On further reading of the PubMed information online, we have clarified that this is the correct way to add data citations to our PubMed deliveries: My Question: Regarding submitting datasets, I am a bit confused by the documentation. Here an object type value of "Dataset" is provided, but here Dataset repository names are provided as the Dataset object type value.

Using the first example, I'd assume the tagging should be:

<Object Type="Dataset">
<Param Name="type">Dryad</Param>
<Param Name="id">10.5061/dryad.2f050</Param>
</Object>

But for the second the tagging should be:

<Object Type="Dryad">
<Param Name="id">10.5061/dryad.2f050</Param>
</Object>
</ObjectList>

PubMed response The first example that you cite from the help would be used to create a linking pair of citations. We commonly use this structure to link comments and corrections to their original article, and the other linking pairs have been grouped in the same category. If you had a PubMed citation to an article, and a second PubMed citation describing a dataset related to the original article, you could create a link between the two. The XML would look like:

<Object Type="dataset">
<Param Name="type">pmid</Param>
<Param Name="id">25264877</Param>
</Object>

This would link the article citation to the dataset citation. (You could also make the link using the dataset ctiation’s DOI rather than the PMID.)

The second example is the XML that you would submit to create an external link to the dataset. …

<Object Type="Dryad">
<Param Name="id">10.5061/dryad.2f050</Param>
</Object>

So, it seems we can only submit datasets from their list for Object, and just using the one example above, I have found a gap of proteomecentral.proteomexchange or PXD: Keyword Comment Dataset Erratum Originalreport Partialretraction Patientsummary Reprint Republished Retraction Update ANZCTR BioProject ClinicalTrials.gov CRiS CTRI ChiCTR DRKS Dryad EudraCT Figshare GDB IRCT ISRCTN JapicCTI JMACCT JPRN NTR Omim PACTR PDB PIR RPCEC ReBec SLCTR SwissProt TCTR UMINCTR UniMES UniParc UniProtKB UniRef NCBI:dbgap NCBI:dbvar NCBI:genbank NCBI:genome NCBI:gensat NCBI:geo NCBI:homologene NCBI:nucleotide NCBI:popset NCBI:protein NCBI:pubchem-bioassay NCBI:pubchem-compound NCBI:pubchem-substance NCBI:refseq NCBI:snp NCBI:sra NCBI:structure NCBI:taxonomy NCBI:unigene NCBI:unists

Melissa37 commented 6 years ago

PubMed: Yes, you are correct, this is a controlled list of allowable values for the secondary source ID list. We are cautious in expanding the list because we are responsible for vetting, reviewing, monitoring and maintaining any links from PubMed. In this case – where the journal participates fully in PMC -- the links are available from the full text there.

Melissa37 commented 6 years ago

This task is awaiting a change in structure output from EJP.

Melissa37 commented 5 years ago

We're ready for this @gnott and @Melissa37

gnott commented 5 years ago

What I see in the kitchen sink XML and a recent published article XML is an assigning authority for the dataset identifier, examples in the kitchen sink being

<pub-id assigning-authority="Dryad" pub-id-type="doi">10.5061/dryad.kj1f3v4</pub-id>

and

<pub-id assigning-authority="NCBI" pub-id-type="accession" xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE48760">GSE48760</pub-id>

The parser is not yet picking up the assigning-authority="NCBI" value, and that will be a first place to start I think.

The eLife article object Dataset is being populated by the dataset JSON data the XML parser makes available. I don't think there is a dataset property in the API schema to fit the assigning-authority value (I want to mention to @thewilkybarkid, in case is has some implications).

The simplest way for me to include it in PubMed outputs is to just add a new property to the JSON output the parser is creating, and this new property should just be ignored by the API schema parser and will cause no harm.

The tentative plan I have so far is: [moved checkbox list to the first comment]

gnott commented 5 years ago

First two checkbox steps are in PR https://github.com/elifesciences/elife-tools/pull/297 awaiting review.

gnott commented 5 years ago

Next two checkboxes are completed in PR https://github.com/elifesciences/elife-article/pull/50.

gnott commented 5 years ago

Drafting the logic in the https://github.com/elifesciences/elife-pubmed-xml-generation library, I have some details and questions.

Current status: Using the latest kitchen sink XML, https://github.com/elifesciences/XML-mapping/blob/master/elife-00666.xml, as the example, here is the new XML included in the Pubmed deposit for the datasets:

<Object Type="Dryad">
    <Param Name="id">10.5061/dryad.kj1f3v4</Param>
</Object>
<Object Type="NCBI">
    <Param Name="id">GSE48760</Param>
</Object>

This includes only doi and accession_id values of the Dataset object.

I tried including the uri value of the third dataset which has the assigning_authority of "other" in the XML, however since other is not a value Pubmed will accept, I've omitted including any plain uri yet, not having a good example use case. Fortunately, if Pubmed does not recognise the value in the <Object> tag's Type attribute, it just ignores it and does not cause any errors on their deposit validation tool.

Issue/question 1: Regarding <Object Type="NCBI">, it is not on the accepted value list specified by Pubmed. If I change it to <Object Type="NCBI:geo"> then Pubmed does display it. @Melissa37, will you be providing more specific NCBI assigning authority values in the XML? If not, do you know how we can determine and include the :geo portion of the Type name?

Issue/question 2: The example with <Object Type="Dryad"> seems to work correctly, it shows like this:

Do you have examples of datasets that would be Figshare type, and will those also have a DOI value? Are there other examples of dataset <pub-id> tags you could share that are possible to add to the Pubmed deposits?

Melissa37 commented 5 years ago

@gnott I will contact PubMed about issue 1 and cc you in. I'd rather they accept NCBI than add further tasks to our production process, but sorry for not being thorough enough in my investigations at set up!

@FAtherden-eLife would you be able to find any examples that Graham is after in the recent archive?

M

fred-atherden commented 5 years ago

@gnott, yes the figshare citations have dois

Example

<element-citation xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:xlink="http://www.w3.org/1999/xlink" id="dataset1" publication-type="data" specific-use="isSupplementedBy">
  <person-group person-group-type="author">
    <name>
      <surname>Kazunori</surname>
      <given-names>Yoshizawa</given-names>
    </name>
    <name>
      <surname>Yoshitaka</surname>
      <given-names>Kamimura</given-names>
    </name>
    <name>
      <surname>Rodrigo</surname>
      <given-names>L Ferreira</given-names>
    </name>
    <name>
      <surname>Charles</surname>
      <given-names>Lienhard</given-names>
    </name>
    <name>
      <surname>Alexander</surname>
      <given-names>Blanke</given-names>
    </name>
  </person-group>
  <year iso-8601-date="2018">2018</year>
  <data-title>Biological Switching Valve</data-title>
  <source>Figshare</source>
  <pub-id assigning-authority="figshare" pub-id-type="doi">10.6084/m9.figshare.6741857</pub-id>
</element-citation>

gnott commented 5 years ago

Thanks @FAtherden-eLife, it looks like DOI value will work for Figshare, although I might need to capitalise the F - I will test it out.

Are you able to find any additional assigning-authority values in datasets we might be able to specify to PubMed?

gnott commented 5 years ago

Tested assigning-authority="figshare" and it works now. Before I think Figshare, with capital F worked, now lowercase is ok.

gnott commented 5 years ago

Thanks for the additional examples @Melissa37 in the Google sheet. From the start we can support these based on the ones you are potentially using now (when looking at the uri to choose the specific NCBI assigning authority:

NCBI:geo
NCBI:dbgap
NCBI:nucleotide
NCBI:sra

gnott commented 5 years ago

If you want to add additional NCBI:xxxx values in the future, we'll need to expand the examples of uri to assigning authority mappings.

gnott commented 5 years ago

Code is merged into the elife-pubmed-xml-generation project, and I will go through the steps to get it deployed for eLife.

gnott commented 5 years ago

Example deposited last week showing datasets on https://www.ncbi.nlm.nih.gov/pubmed/30735131

@Melissa37 do you think we completed this issue, or is there more to do before we close it?

Melissa37 commented 5 years ago

That's great, thanks! We can close.

elifesciences / elife-pubmed-feed

Adding dataset information #61