Handling of erroneous identifier data

emanuil-tolev commented 8 years ago

This can take two forms I've identified so far:

There are zeroes in an identifier column.

PMCID,  PMID, DOI, Article title
123456, 0,    10.1, Test Article
      , 0,    10.2, Test Article 2
98765,  0,        ,

These would all be identified as

123456, 0,    10.1, Test Article

since we issue an OR query to the cache, requiring only one of the identifiers to match - a usually reasonable assumption in the world of publishing. That is, if you have correct data.

IMO we should simply treat 0s as if they were blanks for the purposes of the cache lookup. At the time of writing the PMCID lookup, as an example, is idents.pmcid !== undefined && idents.pmcid !== null && idents.pmcid.length > 0. I think && idents.pmcid !== '0' can be added safely to prevent this particular problem.

This is a convenience feature related to particular user workflows and how those users understand publishing (no PMID or PMCID == "0"). Ultimately there is no ambiguity here, so the fix is straightforward.

There are mistakes in the data

PMCID,  PMID, DOI, Article title
123456, 10,    10.1, Test Article
      , 20,    10.2, Test Article 2
98765,  10,        , Oopsie

In this case, the Oopsie row will be identified as the Test Article row, since the PMID is the same. This is ambiguous.

Currently what seems to happen is overwrite the article title (so Oopsie becomes Test Article) on output. IMO this is a good behaviour - if all the compliance information related to Test Article, but the title still said Oopsie, they would look (to a human) like two distinct records, but the information would all be about Test Article. The overwriting makes it clear that it's all about Test Article.

I don't currently think we should take any action here, but FYI for both of you, since this is probably one of the most important areas where we could encounter erroneous data. I am also discussing cases like this with Wellcome, so they might ultimately have a different point of view on whether Lantern's behaviour needs changing here.

emanuil-tolev commented 8 years ago

Currently what seems to happen is overwrite the article title (so Oopsie becomes Test Article) on output. IMO this is a good behaviour

Live job showcasing this (compare "download original" to "download results" versions to see the effect in action): https://compliance.cottagelabs.com/#urLMywKJkTtdTtDto

emanuil-tolev commented 8 years ago

Deployed

CottageLabs / LanternPM

Handling of erroneous identifier data #78