Send in-situ correction for each article version in Crossmark data

gnott commented 4 years ago

Originating from issue https://github.com/elifesciences/elife-crossref-feed/issues/145, I think it will be clearer to split off details concerning the in-situ Crossmark data here, @Melissa37.

Repeating some of the detalis from the original issue here:

https://www.crossref.org/education/crossmark/crossmark-registering-updates/

New versions From Crossref:

Example 3: in-situ correction When a member does not issue a separate update/correction/retraction notice and instead just makes the change to the document (without changing its DOI either), this is called an in-situ update. In-situ updates or corrections are not recommended because they tend to obscure the scholarly record. How do you tell what the differences are between what you downloaded and the update? How do you differentiate them when citing them (remember, we are only talking about “significant updates” here)? However, some members need to support in-situ updates, and this is how they can be supported.

<updates>
<update type="correction" label="Correction" date="2012-05-12">10.5555/12345681</update>
</updates>

A start on a Definition of Done list:

[x] Q. How does article version history in an article XML affect in-situ updates? Answer: We can ignore this for now, because version history in the XML is still to arrive later in the future for eLife
[x] Q. If there is no version history in an article XML file, would we gather the previous version dates from an external source, which in the case of eLife would be from Lax data? Answer: Yes, we can try getting it from Lax for now, maybe in future it could come from Observer data or maybe BigQuery data? The source if data could vary, it will still go into the same Article object data property
[ ] Test a sample of a correction and in-situ corrections of the article it corrected (where the date of the "v2" of the article corrected will be the same date as the correction article)
[ ] Send data about article versions to Crossref using Crossmark program tags
[ ] Should this always be enabled if the config file has crossmark: true, and the article version data to be included, or should it be a separate config setting whether in-situ corrections are to be included in the Crossref deposits?
[ ] Do the code changes
[ ] Add test scenarios

gnott commented 4 years ago

In looking at this a little more closely today, the Crossref example of the <update type="correction" ... tag is straightforward; it looks just like the tag we are now adding when depositing a correction article, except the DOI of a correction article is the DOI it is correcting, and an in-situ correction would be the same DOI as the article (correcting itself).

The simple Crossref generation logic starts with an article XML file. This XML may or may not include data about the article's version history, the dates of the previous versions of an article. At least in older (and probably current) eLife article XML, it does not include a list of the previous article versions in the XML. In future it may, but it is not always there. Did you intend at all @Melissa37 for eLife's case that the version corrections of articles for eLife would only be deposited after the article history is present in the article XML?

Another posssible source of version data, again in eLife's case, is from the Lax datastore. There is also the ability to be flexible in using the Crossref deposit generation code as a step-wise process. The article data can initially be populated from an article XML file, then data on that Article object can be altered or ammended. When a DepositCrossref workflow is run for eLife, we could gather the previous verisons and dates of each version from Lax and add those to the Article object, prior to generating the Crossref deposit XML as the final step.

The Article does not yet have a property to store article history (I think) but we can add that for visibility and completeness.

@Melissa37 if what I'm describing here is clear enough, am I along the right track and understanding how eLife might deposit in-situ corrections?

Each time a new article version is deposited to Crossref, we'd populate the full history of previous versions. Currently we seem to only need the date of each version for Crossref's purposes.

One interesting situation may arise from a formal correction, although it fits within these rules. For example, say an article is published, then corrected. The correction article would result in a Crossref deposit including a correction that points back to the article it is correcting. The corrected article (a version 2 of that article, I would presume), would result in a Crossref deposit that includes a correction to itself, the date of that correction would be the date the version 2 was published. In this way, a formal correction would result in a two Crossref correction deposits: one formal one, and one in-situ one.

Melissa37 commented 4 years ago

Did you intend at all @Melissa37 for eLife's case that the version corrections of articles for eLife would only be deposited after the article history is present in the article XML?

No, I was planning to base it on Lax or observer data. Ultimately I would like articles to contain their historical version info, but I don't think this will happen for another year.

@Melissa37 if what I'm describing here is clear enough, am I along the right track and understanding how eLife might deposit in-situ corrections?

This sounds perfect, thank you.

Each time a new article version is deposited to Crossref, we'd populate the full history of previous versions. Currently we seem to only need the date of each version for Crossref's purposes.

So when we start publishing history event dates we would have the potential to update the archive if we ever thought it was worthwhile doing it. Nice :-)

One interesting situation may arise from a formal correction, although it fits within these rules. For example, say an article is published, then corrected. The correction article would result in a Crossref deposit including a correction that points back to the article it is correcting. The corrected article (a version 2 of that article, I would presume), would result in a Crossref deposit that includes a correction to itself, the date of that correction would be the date the version 2 was published. In this way, a formal correction would result in a two Crossref correction deposits: one formal one, and one in-situ one.

This is a very good point. We version the article if it has an official correction/erratum notice attached to it. Can you think of a way to circumvent this so it only has the formal and not the in situ one as well? They are generally both published at the same time, so we could put a hold on depositing in situ corrections (say 24 hour window) to Crossref and then do a check to see whether an official correction was done in that 24 or previous 24 hr window. WDYT?

gnott commented 4 years ago

Thanks @Melissa37! These answers are enough for me to continue with making data structures for article history and populating them with data from Lax for a start.

I think for the correction + version resulting in two Crossref deposits is not too easy to resolve at this point. What hapens when a v3 of the article is published? Then the Crossref deposit will not be accompanied with the correction article when it is deposited. Does Crossref de-dupe Crossmark correction "updates-to" data in their systems if two correction records have the same date? I think maybe one way to try it out is to find a suitable article to use as a test and to deposit the correction and the in-situ data for it and see what the result is at Crossref. We can probably reverse that if we don't like the result.

gnott commented 4 years ago

Today I tried a quick example for the in-situ updates data.

For the article's verison history data, there at least a couple way to structured it I think, to solve a potential complication which is: I assume we would never include the version 1 of an article as a Crossmark correction.

Possible way one, it to do this is to rely on eLife's versioning convention, which is all verions are numeric numbers and they start at 1. If this is safe, then we can base it on this, where we'd not include the version 1 in Crossref Crossmark deposits.

Possible way two, is in the article's version history, we push the logic a little further upstream and during parsing the article XML, we add an attribute to each version to indicate which type of version it is. Borrowing from issue https://github.com/elifesciences/issues/issues/3463 (if it is still applicable), we could have VoR and CVoR (Corrected Version of Record), for example. eLife could also continue to use PoA.

Describing this, I think I may have hit on something for eLife's situation, and another quesiton for you @Melissa37 for clarification. If v1 is a PoA, v2 is VoR, and v3 is a VoR version, then would we only want to report the v3 to Crossref as an in-situ correction? In this situation, I think it would be safer and more straightforward to label each article version in the version history with some labels so we can produce a good Crossref deposit.

Melissa37 commented 4 years ago

For the article's verison history data, there at least a couple way to structured it I think, to solve a potential complication which is: I assume we would never include the version 1 of an article as a Crossmark correction.

Can I just clarify what you mean by this? Are you saying Crossref does not consider the difference between a PoA and a VoR as a new version and we can ignore PoA 1 or PoA versions from this?

I think that is correct right now, but I personally would like to change Crossref's thinking on this! I think the difference between a PoA and VoR has a LOT of difference, but I guess I would think that as I deal with production and see all the value we add ;-)

Possible way two, is in the article's version history, we push the logic a little further upstream and during parsing the article XML, we add an attribute to each version to indicate which type of version it is. Borrowing from issue elifesciences/issues#3463 (if it is still applicable), we could have VoR and CVoR (Corrected Version of Record), for example. eLife could also continue to use PoA.

This is interesting and I like the sound of this. The only problem with this is that for each new version (if there are multiple versions) you would lose the added details you have as it comes from Exeter and production again? I still want to introduce this into production as we'd also give a reason for the change, which I feel should be in the XML - currently this is stored in Hypothesis commenting. However, we could use what work you do to provide some validation - ie in future what you would add if missing from the XML or different could result in a rejection?

Melissa37 commented 4 years ago

Describing this, I think I may have hit on something for eLife's situation, and another quesiton for you @Melissa37 for clarification. If v1 is a PoA, v2 is VoR, and v3 is a VoR version, then would we only want to report the v3 to Crossref as an in-situ correction? In this situation, I think it would be safer and more straightforward to label each article version in the version history with some labels so we can produce a good Crossref deposit.

I think I answered the question above. But I am curious about

safer and more straightforward to label each article version in the version history with some labels so we can produce a good Crossref deposit.

what is adding these labels and to where? In the future they will be in the ML coming from production, but until then...

gnott commented 4 years ago

Regarding the PoA to VoR being a correction or not, I may have mis-remembered discussions about how a PoA is the same article in a different format, which was not considered to be a correction. I may have this wrong. If you consider PoA to VoR as a correction, then we can most certainly include that as a correction in Crossref Crossmark data.

Melissa37 commented 4 years ago

I consider it a new version that warrants an update in Crossref's CrossMark widget, but I don't consider it a correction. So, in essence, you can ignore me :-)

gnott commented 4 years ago

For the version history data to support Crossref Crossmark, it would be specified and stored in the elife-article Python objects, which is the data that populates Crossref, PubMed and PoA XML generation Python libraries. Since it only affects these parts, it would be separate from any other schemas.

Each version history event would probably have:

version version_type (PoA, VoR, CVor, etc.) date

at a minimum, and these are attached to an Article that already has a DOI.

Optionally, the version history event could store a "comment" or additional details, but there is no regular source for those details right now, nor does Crossref Crossmark need those.

The version event data would come from Lax and be slotted in place as part of a Crossref deposit workflow.

gnott commented 4 years ago

You brought up a good point, thanks @Melissa37, because we are not limited to "correction". These are the values allowed from the Crossref schema (https://www.crossref.org/schemas/common4.4.1.xsd)

addendum
clarification
correction
corrigendum
erratum
expression_of_concern
new_edition
new_version
partial_retraction
removal
retraction
withdrawal

So we could report the VoR as a new_version and the CVoR as a correction, or maybe it should be a new_version... ?

Melissa37 commented 4 years ago

aha!! New version all the way for PoA to VoR and for new VoRs that do not have an official correction associated with them?

Guess this changes things?!

gnott commented 4 years ago

It helps, althought we might discuss on a next call?

I think the challenge would be that sometimes a new version is a correction, if there was a formal correction published, it would be associated with a particular version. If the dates of the correction and the article version match, we may be able to collate all the version history event types.

Melissa37 commented 4 years ago

Fab, on the agenda! :-)

Melissa37 commented 4 years ago

Pending JATS4R versioning history recommnedation

gnott commented 4 years ago

I had a bit of code in development until this was blocked by looking for version history data.

To act as a reminder of what I had done, which will mostly get reversed now, in elifecrossref/crossmark.py, I expanded the criteria for do_updates() to include whether the article object had version_history:

def do_updates(poa_article):
    """decide if crossmark updates tag can be added"""
    return bool(
        (
            poa_article.article_type in UPDATES_ARTICLE_TYPES and
            poa_article.related_articles and
            poa_article.related_articles[0].xlink_href
            )
        or
        (
            hasattr(poa_article, 'version_history') and
            poa_article.version_history
            )
        )

then in set_updates(),

...
    if hasattr(poa_article, 'version_history') and poa_article.version_history:
        for previous_version in poa_article.version_history:
            set_update(
...

I will merge in the new set_update() function that I had split out to make things cleaner, which is not blocked by version history data.

I will also remove the test scenario I created for in-situ correction, because it is pretty simple and not very elaborately done yet.

gnott commented 2 years ago

I had a fresh look at this issue, ready to discuss next time @FAtherden-eLife.

I think perhaps it would be a good idea to create a new issue and add to it the questions and tasks going forward, and then we can close this older discussion.

There's currently support for depositing correction and retraction articles. Can all the other updates we want to deposit be new_version type?

I believe we wanted to get the version history from a non-article-XML source, either Lax or data hub.

A potential wrinkle we may want to consider is whether the version history data is better deposited to Crossref in the post-publication tasks. Now we have Pending Publication DOI logic enabled, can we deposit full Crossref metadata post-publication always now? It might be a good time to review at which point in the publication workflow Crossref deposits happen.

fred-atherden commented 2 years ago

Thanks @gnott, sounds good - let's discuss in elifesciences/issues#7177.

elifesciences / elife-crossref-feed

Send in-situ correction for each article version in Crossmark data #147