Writing out PMCIDs on articles failed to be found in EPMC

emanuil-tolev commented 7 years ago

This could be related to the EPMC throttling problem we've been having. The problem articles had a note they were not in EPMC - they actually are in EPMC.

Here is a correct full result set for the original file they sent you via email: https://compliance.cottagelabs.com/#swBsAjZfsaw9HpkvH . Send this to the user asking.

All articles are fine, as you can see. I did not change anything on the live system, so it's definitely an intermittent situation. There probably is an issue with how we handle error states - we might well be stripping identifiers of the PMC prefix on output under certain situations. If you do "download original" you will see the the PMC prefix is stripped from all of them - I'd guess the same code is responsible for removing it from rows 4 and 11 of the problematic results sheet.

The attached two files, both "the original" (one the actual original emailed by the user, the other one downloaded from Lantern) show the difference clearly.

The PMCIDs have not actually been changed, the number part is the same - it's a formatting question.

Original files with descriptive names to tie them to the above:

WT2015_original_files.zip

Problematic results file with rows 4 and 11 formatted strangely, upsetting the user:

WT2015_problematic_result_rows_4_and_11_wrong_PMCID_format.zip

markmacgillivray commented 7 years ago

Yes, I think you have discovered what the user already told us. The question is, what are we expected to format for wellcome output - PMC IDs with or without the "PMC" prefix, and if the user has already provided one, should we be overwriting them or not anyway?

So note, this is not the same problem as whether or not epmc is down and whether or not we fail to find something we should find.

emanuil-tolev commented 7 years ago

So note, this is not the same problem as whether or not epmc is down and whether or not we fail to find something we should find.

No, it's not, but it seems to produce different results for the records where it failed (the others are prefixed with PMC - despite all rows being prefixed with PMC in the upload).

The question is, what are we expected to format for wellcome output - PMC IDs with or without the "PMC" prefix, and if the user has already provided one, should we be overwriting them or not anyway?

Yes, and that is quite a complex question to answer unfortunately... I'd say for the first part, most people seem to expect it with the prefix - may be a way to more quickly distinguish from the PMID.

The overwriting-or-not is the harder question. We should ensure it is clear what IDs the information on the right is related to. If it takes overwriting to do that (e.g. we found a different PMCID from the uploaded one when we went through the PMID or something), then we should overwrite.

The inconsistent formatting made it look like a bit of a problem in this case, so on the other hand if there is actually no need to overwrite, then we should not.

emanuil-tolev commented 7 years ago

I don't personally feel this is particularly urgent at all btw.

markmacgillivray commented 7 years ago

I believe PMC IDs should indeed be preceded by PMC. So if we don't do that when we do choose to overwrite, isn't that a simple question of what wellcome required? And similarly, whether or not we overwrite what they only with what we find would seem to be a question of requirements too.

If we don't know which way they require, then we can have no answer. So either we failed to achieve their requirement during our testing, or they didn't yet have a requirement for this. In that case, we'd have to say which way we expect it to work, and stipulate that is how it works.

That's why I thought this issue would have been relevant to the previous testing rounds you did with wellcome - to know how they expect it, and to know if we are supposed to know how they expect it.

If it's not urgent, then what should @richard-jones say to the customer, and when should they expect to hear back from us with an answer?

On 22 Dec 2016 19:39, "Emanuil Tolev" notifications@github.com wrote:

I don't personally feel this is particularly urgent at all btw.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CottageLabs/LanternPM/issues/130#issuecomment-268874559, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuXCGeLd0uthtzweks-UEE5RViGQGkJks5rKtHkgaJpZM4LURpF .

richard-jones commented 7 years ago

Just picking this up again, as I consider how to respond with an update to the support request.

I am certain that PMC ids must have the prefix "PMC" - this is not an optional part of the identifier; without it the id is not a pmcid. So, whatever happens, our output should always include it.

In fact, I'm not sure how we would get into the situation of having a PMCID without the PMC prefix (it would be like having a doi without the 10. prefix), so it would be worth looking into that.

It's possible that we're allowing users to enter PMCIDs without that prefix in the PMCID column, in which case we just need to determine what the appropriate behaviour is. In the previous version of the system, I believe it added the prefix and updated the processing notes explaining the normalisation, so we should just continue to do that if we are not already.

markmacgillivray commented 7 years ago

Users can provide PMCIDs without "PMC" prefix, but we always add it. We also always show it in exported results, even if we could not find the item in EPMC due to failure/limiting at their end, so now the export would show "PMC" prefixes on all PMCIDs even if the user did not provide them.

CottageLabs / LanternPM

Writing out PMCIDs on articles failed to be found in EPMC #130