elifesciences / elife-pubmed-feed

code to support uploading feeds to pubmed for POA articles and VOR articles
1 stars 4 forks source link

PubMed Group author members not being supplied correctly #79

Closed fred-atherden closed 3 years ago

fred-atherden commented 3 years ago

Problem / Motivation

We've noticed that group authors members are not being supplied correctly to PubMed.

Take 60675 as an example. The 45th Author (Oxford University Hospitals Staff Testing Group) is a group with 108 members.

In PubMed, the group members ~aren't enumerated~ are included at the end of the author list. According to https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Can_collaborator_names_be, each member should be tagged using <IndividualName> in a <Group> element.

~In elife-pubmed-60675-20200821114508.xml this information appears to be missing:~ In elife-pubmed-60675-20200911154510 the group members are suffixed to the end of the author list:

<Author>
                <FirstName>Adam JR</FirstName>
                <LastName>Watson</LastName>
                <Affiliation>University of Oxford, Oxford, United Kingdom</Affiliation>
            </Author>
            <Author>
                <FirstName>Adan</FirstName>
                <LastName>Taylor</LastName>
                <Affiliation>University of Oxford, Oxford, United Kingdom</Affiliation>
            </Author>
            <Author>
                <FirstName>Alan</FirstName>
                <LastName>Chetwynd</LastName>
                <Affiliation>University of Oxford, Oxford, United Kingdom</Affiliation>
            </Author>

so perhaps some regression/change has occurred since #28

fred-atherden commented 3 years ago

Could this be due to an XML change? It looks like from around 2017 there was a change in they way we captured group authors in eLife XML.

Take for example 10.7554/eLife.26030, which has 'Collaborators' enumerated in PubMed. Group members are outlined in a third contrib-group in the XML.

Whereas 10.7554/eLife.25801 is captured in the method (group members are captured as a child of the relevant collab in the first contrib-group) and these aren't enumerated correctly in PubMed.

If my theory is right, that would mean that once a fix is in place, the following articles need resupplying to PubMed:

10.7554/eLife.25801 (for Reproducibility Project: Cancer Biology)
10.7554/eLife.28721 (for Swiss HIV Cohort Study)
10.7554/eLife.29747 (for Reproducibility Project: Cancer Biology)
10.7554/eLife.30274 (for Reproducibility Project: Cancer Biology)
10.7554/eLife.34364 (for Reproducibility Project: Cancer Biology)
10.7554/eLife.37754 (for The New York Genome Center ALS Consortium)
10.7554/eLife.39856 (for eQTLGen Consortium)
10.7554/eLife.39944 (for Reproducibility Project: Cancer Biology)
10.7554/eLife.42463 (for STOP-HCV Consortium)
10.7554/eLife.43511 (for Reproducibility Project: Cancer Biology)
10.7554/eLife.45120 (for Reproducibility Project: Cancer Biology)
10.7554/eLife.45426 (for Reproducibility Project: Cancer Biology)
10.7554/eLife.51019 (for Reproducibility Project: Cancer Biology)
10.7554/eLife.51593 (for Tulsa 1000 Investigators)
10.7554/eLife.54558 (for IAVI Protocol C Investigators)
10.7554/eLife.55684 (for MRC AIMS Consortium)
10.7554/eLife.56651 (for Reproducibility Project: Cancer Biology)
10.7554/eLife.57390 (for HipSci Consortium)
10.7554/eLife.58699 (for CMMID COVID-19 Working Group)
10.7554/eLife.58728 (for The CITIID-NIHR COVID-19 BioResource Collaboration)
10.7554/eLife.59391 (for The CITIID-NIHR COVID-19 BioResource Collaboration)
10.7554/eLife.60352 (for GDSC Screening Team)
10.7554/eLife.60675 (for Oxford University Hospitals Staff Testing Group)

Whereas these ones would be fine:

10.7554/eLife.02935
10.7554/eLife.04024
10.7554/eLife.04034
10.7554/eLife.04037
10.7554/eLife.04105
10.7554/eLife.04180
10.7554/eLife.04186
10.7554/eLife.04363
10.7554/eLife.04586
10.7554/eLife.04796
10.7554/eLife.06416
10.7554/eLife.06434
10.7554/eLife.06847
10.7554/eLife.06938
10.7554/eLife.06959
10.7554/eLife.07072
10.7554/eLife.07301
10.7554/eLife.07383
10.7554/eLife.07420
10.7554/eLife.08245
10.7554/eLife.08648
10.7554/eLife.08714
10.7554/eLife.08997
10.7554/eLife.09462
10.7554/eLife.09976
10.7554/eLife.10012
10.7554/eLife.10860
10.7554/eLife.11414
10.7554/eLife.11566
10.7554/eLife.11999
10.7554/eLife.12217
10.7554/eLife.12470
10.7554/eLife.12626
10.7554/eLife.13410
10.7554/eLife.13620
10.7554/eLife.14258
10.7554/eLife.15266
10.7554/eLife.17044
10.7554/eLife.17584
10.7554/eLife.18173
10.7554/eLife.21253
10.7554/eLife.21634
10.7554/eLife.25306
10.7554/eLife.26030

[Edit: removing bug label and adding enhancement]

gnott commented 3 years ago

I'm looking at details and I don't have the full picture yet, but a couple notes. The elife-pubmed-60675-20200911154510.xml batch file is for the VoR which do include the author names, but they're not deposited as a GroupList.

I think it probably does relate to XML changes when the group author ID value disappeared, and in its place the group authors were nested inside the <collab> tag in the XML. I'm seeing some logic in the XML parser and PubMed XML generator to process it, and a little more testing is required to determine why it isn't functioning as expected in this example.

gnott commented 3 years ago

A clarification in the above, the elife-pubmed-60675-20200821114508.xml deposit was for the v1 PoA XML, and the members of the group are not present in the article XML, so they're not to be found in that PubMed deposit file. The elife-pubmed-60675-20200911154510.xml is for the VoR XML.

fred-atherden commented 3 years ago

Ah yes, I see now that the members are suffixed to the end of the author list.

I've edited the description above.

gnott commented 3 years ago

A possible difference is by comparing with the eLife kitchen sink XML test fixture for the PubMed generation project, https://github.com/elifesciences/elife-pubmed-xml-generation/blob/develop/tests/test_data/elife-00666.xml, the members of a group have a simple <contrib> tag.

In https://github.com/elifesciences/elife-article-xml/blob/master/articles/elife-60675-v2.xml, the tags are <contrib contrib-type="author"> for members of the group, and I think the parser is ignoring their nestedness, instead treating them as authors of the article.

gnott commented 3 years ago

Some very early results, I'm getting some encouragingi output for this example article with a few small modifications.

It looks like the XML parser is setting group_author_key values for all the contributors in a good way, when the contributors are nested inside a <collab> tag, regardless of the @contrib-type attribute value.

There is some logic here relying on a value I put into the pubmed.cfg configuration file named group_author_contrib_types (https://github.com/elifesciences/elife-pubmed-xml-generation/blob/develop/pubmed.cfg#L13). That's where, using the default values, it ignores considering the contrib-type="author" contributors as potentially being part of a group.

I want to review that functionality with fresh eyes later to see check if we can just rely on whether the contributor has a group_author_key value or not is good enough to consider group or non-group inclusion.

Here's what the PubMed generation output looks like with the potential code change when I upload it to the PubMed citation validation page it looks promising!

image

Melissa37 commented 3 years ago

Fabulous, thank you! M

gnott commented 3 years ago

Oops, I replied to the wrong issue today (https://github.com/elifesciences/elife-pubmed-feed/issues/28#issuecomment-765737558), I'll duplicate it here:

Sitting down today, I resolve some errors with existing test cases and trying to confirm the strategy was sound I was following. The code change is in PR https://github.com/elifesciences/elife-pubmed-xml-generation/pull/44.

It is an improvement, I think, to remove the group_author_contrib_types configuration file value, because it was not working and was going to cause more trouble than instead relying on contributor group_author_key attributes that we are already setting when parsing the article XML.

I feel confident with the test coverage and the impact of the change, it's ready to merge and deploy, which I will move to do today. We can adjust or rollback the week of January 24th if we observe any errors.

gnott commented 3 years ago

I resupplied eLife aritcle 60675 to PubMed right now as a test, and I think the generated file looks as expected. This may appear in PubMed next week, or maybe you'll get an error report from PubMed if it cannot be processed, hopefully the former.

@FAtherden-eLife, if you would like to check its PubMed page again in the next few days, it might look better. Then we can resupply other group author articles to PubMed if it worked.

fred-atherden commented 3 years ago

Thanks @gnott, checking today, it looks good. So if we can proceed with re-sending the other affected group author articles, that would be great.

gnott commented 3 years ago

These remaining articles are queued to be on their way to PubMed again in the hour or less:

10.7554/eLife.25801
10.7554/eLife.28721
10.7554/eLife.29747
10.7554/eLife.30274
10.7554/eLife.34364
10.7554/eLife.37754
10.7554/eLife.39856
10.7554/eLife.39944
10.7554/eLife.42463
10.7554/eLife.43511
10.7554/eLife.45120
10.7554/eLife.45426
10.7554/eLife.51019
10.7554/eLife.51593
10.7554/eLife.54558
10.7554/eLife.55684
10.7554/eLife.56651
10.7554/eLife.57390
10.7554/eLife.58699
10.7554/eLife.58728
10.7554/eLife.59391
10.7554/eLife.60352
fred-atherden commented 3 years ago

Great - thanks G!

fred-atherden commented 3 years ago

Hi @gnott, we've received an error report from PubMed for 55684:

File elife-pubmed-55684-20210125184531.xml is not valid. element Group: validity error : Element Group content does not follow the DTD, expecting (GroupName? , IndividualName+), got (GroupName) /AuthorList>MRC AIMS Consortium

Looks like it's expecting group members. This one doesn't actually have any so it didn't need to be resupplied (my mistake - sorry), but nevertheless we should ensure the logic is correct in the PubMed generation project.

Looks like in cases where there's a group, but no individuals it needs to be tagged like this:

<Author>
   <CollectiveName>MRC AIMS Consortium</CollectiveName>
</Author>

We've also had failures for the following:

Since we're updating these do they need the Replaces tag? It does seem strange that some of them (e.g. 25801 seem to have been processed fine though, while others haven't).

gnott commented 3 years ago

Thanks Fred. For the first, 55684 must be a case not yet covered in the test scenarios where there's a <collab> which has no group members. The recent code change is clearly adding <GroupList> to the PubMed deposit when it should not. I'll add a test case and solve the logical cause.

gnott commented 3 years ago

For the other articles, it looks like according to https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.How_do_I_interpret_the_Pub it looks like we cannot alter them by a resupply:

"Article matches PMID = , which is not in " [PubMed – as supplied by publisher] " status. This message means the data is already in PubMed and cannot be modified by a Replacement file."

Over to you please @FAtherden-eLife whether you think these could be edited in the PubMed self-management system.

gnott commented 3 years ago

Next bug fix is deployed, and I re-supplied article 55684 to PubMed now. I hope it results in no failure and a correct PubMed page.

fred-atherden commented 3 years ago

Many thanks @gnott. 55684 looks good. I've manually edited the records for those article that weren't in [PubMed – as supplied by publisher] status, so think we can consider this closed.

gnott commented 3 years ago

Thanks for fielding the error reports and fixing the stubborn articles Fred, we got it done!