Closed richard-jones closed 7 years ago
Have changed in dev but not deployed to live. If 1 result from exact "title" match is found, then confidence is 0.9. If 1 result is not returned, a title match (no quotes) is performed, and if that returns 1 result then confidence is 0.8. @richard-jones is this good enough? If so I will close the issue and deploy to live.
Yes, that sounds the same as the previous behaviour, though I think we set the confidence lower (0.7 IIRC).
When you've pushed this, could you assign to @emanuil-tolev just to review the documentation and make sure it matches up?
EPMC confidence matches in old system:
We don't have to follow this old scoring. The (not accurate) results I've seen on inexact matches from EPMC do make me think we want to set it to 0.6 or 0.7 to reflect the quality of the results.
Could you also prod me when this is rolled out to live, so I can respond to the original email from Wellcome.
Changed to use 1, 0.9 and 0.7
@richard-jones this is on live now, closing
thanks - I just need to double check docs
Actually the code now exactly matches the docs, which always did imply the old-style behaviour (confidence 1, 0.9 or 0.7).
A bit more feedback:
something doesn't seem right. I can see that in the Correct Article Confidence column we have 0.9, 0.7 and 0 which is good. However, all the rows where it says 0.7 actually have no article data associated with them. If no article/article IDs are found, then I think the Correct Article Confidence score should be 0.
Could you look into this, and make it so that if no article is found the Confidence Score is 0, and if an article is found via the title keyword search (i.e. TITLE:no quote marks) it is 0.7?
Sounds pretty simple, just a result of the 1 - 0.9 - 0.7 cascade, should have a 0 if no results at all. I'll make a PR as soon as I've redone my other PR, which will hopefully be shortly.
@markmacgillivray this is very odd. The quoted job https://compliance.cottagelabs.com/#vnLgjhGX69mBuK8Np has both articles with 0 confidence, and articles with 0.7. But we haven't got identifiers for both 0s and 0.7s - they all have notes like this:
Unable to locate article in EPMC.
Unable to obtain DOI, PMID or PMCID for this article. Compliance information may be severely limited.
Not attempting Crossref or CORE lookups - do not have DOI for article.
Not attempting Grist API grant lookups since no grants data was obtained from EUPMC.
Not checking ahead of print status on pubmed. We don't have the article's PMID.
Not attempting to add any data from Sherpa Romeo - don't have a journal ISSN to use for lookup.
Unable to retrieve licence data via article publisher splash page (used to be OAG) - cannot obtain a suitable URL to run the licence detection on.
None of them have PMCID, PMID or DOI. If it rings a bell as to potentially why, let me know, otherwise I'll keep digging - seems quite odd to have essentially the same totally negative result and two different lookup confidence scores.
@emanuil-tolev @markmacgillivray - is this fixed?
I don't know - looks like the issue is waiting for further input from @emanuil-tolev
@emanuil-tolev are you going to do this?
I have looked up the code, confidence starts at 0, so for things where we find nothing more, it stays at 0. However if we get as far as doing a non-exact title match against EPMC and we find something that COULD be the article, the confidence is set to 0.7. If we are finding something in EPMC but it somehow does not have a PMCID, then we could have a record with a confidence of 0.7 and yet no PMCID. Does this matter?
then we could have a record with a confidence of 0.7 and yet no PMCID. Does this matter?
Yeah, I reckon that might matter, and any similar cases (if there are any). The original feedback above https://github.com/CottageLabs/LanternPM/issues/106#issuecomment-242399920 states exactly that - the article should appear as a 0 to the user if we have not actually been able to get article data.
It's a bit odd otherwise, a 0.7 with no identifiers .. so what are we 0.7 confident in?
If it is possible for EPMC to have records in it that do not have PMC IDs or PMIDs or DOIs, then a non-exact title match may return back some sort of record with a similar title but no PMC IDs or PMIDs or DOIs. So we would be 0.7 confident that the record EPMC returned is correct, but we would have no useful information to do anything further with.
I have downloaded the compliance job and will test with the values they are referring to. I am already doing work on the lantern backend anyway, and want to get these issues sorted. Will reassign to myself for further investigation.
Selecting one that had a title match of 0.7 but no match in eupmc and running a fuzzy match on eupmc returns a result with 4 items, none of which are right, so should result in a match of 0.
Running the title as the only item in a spreadsheet does indeed return 0.
Ran a subset of 100 of the ~600 in the example problematic job, including the above title, and got results back as expected - did not match the above title, set confidence to 0, and the only title that did get a 0.7 was one that we DID find in eupmc.
So it seems this problem no longer exists as the API currently stands.
In the event that none of the identifiers provided allow us to uniquely resolve a record in EPMC, we will fall back to a best-guess matching approach:
An exact title (substring) match, using http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=TITLE:”[title]” A fuzzy title match, using http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=TITLE:[title]
If we receive exactly one result for either of these, we will take it as the correct item. If we match successfully with (1) we will record a higher confidence in the accuracy of the identification than if we match with (2).
Note that (1) is in fact an exact substring match, so if the text in quotes appears as a substring in multiple titles, it will return multiple results. Since we will be querying on full titles, this should not matter most of the time, but if it fails we will fall back to (2).