CottageLabs / LanternPM

Lantern meta repository for product management
1 stars 0 forks source link

Author manuscript is false, but EPMC shows they as being true #112

Closed richard-jones closed 8 years ago

richard-jones commented 8 years ago

found a number of articles which were listed as author manuscript?= FALSE but which are clearly author manuscripts if you look on Europe PMC Examples:

PMC3051421 PMC3086759 PMC3116142

@emanuil-tolev to review and assign to @markmacgillivray if needing fixed

emanuil-tolev commented 8 years ago

These are TRUE when I put these in a test sheet. I will feed back to Cecy, I don't know why she got False?

richard-jones commented 8 years ago

Could this be an intermittent problem?

emanuil-tolev commented 8 years ago

Hmm. Perhaps EPMC failing would cause this. We need to make sure we put "unknown" in such situations. I asked Cecy for a link to her old job where she got False.

Regardless of her response, I will go on with running the test sheet locally and then disable EPMC by editing my local /etc/hosts, then see what I get. Ideally we want "unknown" and a suitable note. The code currently does a few catch {}-s though (i.e. exception swallow) that are in place due to the fact HTTP errors used to crash the jobs, so if this is the problem it might need slightly more serious looking into.

markmacgillivray commented 8 years ago

It is possible that epmc fail could be the cause. But then I'd expect other info to be missing, so depends on what we could see in the job she ran, if we know it. Also it would need to be fairly recent because before that epmc fails would just knock it over and we'd get no result at all. Another possibility if she ran this a while ago is a cached result that changed after we introduced new functionality. To date, epmc is the only remote source we use that I've personally seen direct evidence of it being down, so yeah it's worth looking into that.

On 22 Aug 2016 18:56, "Emanuil Tolev" notifications@github.com wrote:

Hmm. Perhaps EPMC failing would cause this. We need to make sure we put "unknown" in such situations. I asked Cecy for a link to her old job where she got False.

Regardless of her response, I will go on with running the test sheet locally and then disable EPMC by editing my local /etc/hosts, then see what I get. Ideally we want "unknown" and a suitable note. The code currently does a few catch {}-s though (i.e. exception swallow) that are in place due to the fact HTTP errors used to crash the jobs, so if this is the problem it might need slightly more serious looking into.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CottageLabs/LanternPM/issues/112#issuecomment-241495411, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuXCIQWlywqWiLOT3TyrnaWiBMy_bUqks5qieK9gaJpZM4JpoOp .

emanuil-tolev commented 8 years ago

EPMC could fail for one request but not another (i.e. like an intermittent failure, rather than "blanket" failure for X minutes). We'll see which one it is. We should likely add a bit of information to the processing notes anyway to say "we could not retrieve X" specifically because of an error on the data source's side.

markmacgillivray commented 8 years ago

Yes but I mean if it failed for one request then there would be a bunch of info missing in the result of that process. Unless it failed ONLY for the author manuscript request for the given process.

On 22 Aug 2016 19:09, "Emanuil Tolev" notifications@github.com wrote:

EPMC could fail for one request but not another (i.e. like an intermittent failure, rather than "blanket" failure for X minutes). We'll see which one it is. We should likely add a bit of information to the processing notes anyway to say "we could not retrieve X" specifically because of an error on the data source's side.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CottageLabs/LanternPM/issues/112#issuecomment-241499620, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuXCGyYsAs4qNrsf99tgaiw_N39Byonks5qieXxgaJpZM4JpoOp .

emanuil-tolev commented 8 years ago

Heh, turns out it depends on which bit of EPMC fails. If europepmc.org fails (I directed it to 10.10.10.10, a sure connection timeout on my current network), then it takes a long time to process, but eventually does somehow find out what the PMID and DOI are. Then it uses ebi.ac.uk very successfully for Grist and CORE lookups. In fact the results look pretty complete, with the exception that author manuscript? is FALSE but should clearly be unknown.

Also, there is a note in the log for "Error while fetching [europepmc.org url] for academic licence check".

emanuil-tolev commented 8 years ago

(In contrast to the ebi.ac.uk endpoint, europepmc.org is the JSON REST API and the HTML page used for author manuscript and licence detection, as well as EPMC fulltext XML.)

emanuil-tolev commented 8 years ago

@markmacgillivray if you'd like to take a look at https://github.com/CottageLabs/api/pull/12 , that's a stab at resolving this issue by adding the possibility of "unknown" in the Author Manuscript column, and adding appropriate notes to the processing notes column.