elifesciences / elife-tools

Python library for parsing eLife article XML data.
MIT License
15 stars 7 forks source link

Add doi to pub_history event data parsed #304

Closed gnott closed 5 years ago

gnott commented 5 years ago

Re: issue https://github.com/elifesciences/issues/issues/4284, there's a new XML sample that includes a doi. I added it to the pub_history() output and added a new XML sample for the latest XML.

If you might want to review @lsh-0 - I think you're the next possible user of this data.

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.001%) to 99.559% when pulling d0585b0cdbdafb291fb48b1088c3d96acecb084b on pub-history-doi into fb26f4156cfca73e448bcd431d2cbb396f2034a3 on develop.

lsh-0 commented 5 years ago

I've worked with systems where the DOI isn't the preferred article ID or is one of multiple. It got messy in the inevitable upgrade to support multiple IDs.

I would suggest including all data we have available and assuming we'll see multiple article-id elements:

        ('event_type', 'preprint-publication'),
        ('event_desc', 'This article was originally published as a <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1101/118356">preprint</ext-link> on bioRxiv.'),
        ('event_desc_html', 'This article was originally published as a <a href="https://doi.org/10.1101/118356">preprint</a> on bioRxiv.'),
        ('uri', 'https://doi.org/10.1101/118356'),
        ('uri_text', 'preprint'),
        ('id_list', [
            OrderedDict(
                ('type', 'doi'),
                ('value', '10.1101/118356'),
                ('assigning-authority', 'crossref')),
        ]),
        ('day', '24'),
        ('month', '03'),
        ('year', '2017'),
        ('date', date_struct(2017, 3, 24)),
        ('iso-8601-date', '2017-03-24')
        ])

or something similar.

lsh-0 commented 5 years ago

from the main ticket, ages ago:

Also, I don't know whether this would happen, but what if someone put a preprint in many locations, and they would all be version 0? This would not work.

which would mean that each location the article lived at prior to publication with elife would have it's own ID, and not necessarily a DOI issued by crossref

gnott commented 5 years ago

I think in practice an <event> in <pub-history> will not often have multiple <article-id> tags, but I'm happy to change the id data into a list as you described it, thanks! The output of this function is not used yet, so it is an easy time to do it.

We can ignore the other values we get from <article-id> in the article, sub-article, or citations, for now, until we also want to expose an id_list for those data structures.

gnott commented 5 years ago

which would mean that each location the article lived at prior to publication with elife would have it's own ID, and not necessarily a DOI issued by crossref

I think if an article has multiple preprints or multiple versions, each will get its own <event> tag inside the <pub-history> tag. How these are stored or displayed on journal remains unknown to me, though, which is where I think the concept of a version 0 originated.

gnott commented 5 years ago

Also, I suspect not every preprint would have a DOI, and instead only a URI. I believe the recent XML change was to reflect how bioRxiv specifically will assign DOI to articles, and eLife XML can specify their location additionally by DOI (as well as the URI in the <event-desc>).

lsh-0 commented 5 years ago

How these are stored or displayed on journal remains unknown to me, though, which is where I think the concept of a version 0 originated.

It was going to be part of the article's publication history, except with slightly more detail than each item currently has.

gnott commented 5 years ago

@lsh-0 do you have any additional comments before this PR is merged? I think I addressed the possible multiple id values that an element may have.

lsh-0 commented 5 years ago

if the multiple ID values are addressed it should be good to go

gnott commented 5 years ago

👍