Title matching in EPMC should start with exact match and fall back to fuzzy

richard-jones commented 8 years ago

In the event that none of the identifiers provided allow us to uniquely resolve a record in EPMC, we will fall back to a best-guess matching approach:

An exact title (substring) match, using http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=TITLE:”[title]” A fuzzy title match, using http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=TITLE:[title]

If we receive exactly one result for either of these, we will take it as the correct item. If we match successfully with (1) we will record a higher confidence in the accuracy of the identification than if we match with (2).

Note that (1) is in fact an exact substring match, so if the text in quotes appears as a substring in multiple titles, it will return multiple results. Since we will be querying on full titles, this should not matter most of the time, but if it fails we will fall back to (2).

markmacgillivray commented 8 years ago

Have changed in dev but not deployed to live. If 1 result from exact "title" match is found, then confidence is 0.9. If 1 result is not returned, a title match (no quotes) is performed, and if that returns 1 result then confidence is 0.8. @richard-jones is this good enough? If so I will close the issue and deploy to live.

richard-jones commented 8 years ago

Yes, that sounds the same as the previous behaviour, though I think we set the confidence lower (0.7 IIRC).

When you've pushed this, could you assign to @emanuil-tolev just to review the documentation and make sure it matches up?

emanuil-tolev commented 8 years ago

EPMC confidence matches in old system:

By any ID (PMCID, PMID, DOI) = 1
By exact substring title match (quoted query) = 0.9
By approximate match (unquoted query) = 0.7

We don't have to follow this old scoring. The (not accurate) results I've seen on inexact matches from EPMC do make me think we want to set it to 0.6 or 0.7 to reflect the quality of the results.

richard-jones commented 8 years ago

Could you also prod me when this is rolled out to live, so I can respond to the original email from Wellcome.

markmacgillivray commented 8 years ago

Changed to use 1, 0.9 and 0.7

markmacgillivray commented 8 years ago

@richard-jones this is on live now, closing

emanuil-tolev commented 8 years ago

thanks - I just need to double check docs

emanuil-tolev commented 8 years ago

Actually the code now exactly matches the docs, which always did imply the old-style behaviour (confidence 1, 0.9 or 0.7).

emanuil-tolev commented 8 years ago

A bit more feedback:

something doesn't seem right. I can see that in the Correct Article Confidence column we have 0.9, 0.7 and 0 which is good. However, all the rows where it says 0.7 actually have no article data associated with them. If no article/article IDs are found, then I think the Correct Article Confidence score should be 0.

Could you look into this, and make it so that if no article is found the Confidence Score is 0, and if an article is found via the title keyword search (i.e. TITLE:no quote marks) it is 0.7?

Job https://compliance.cottagelabs.com/#vnLgjhGX69mBuK8Np

Sounds pretty simple, just a result of the 1 - 0.9 - 0.7 cascade, should have a 0 if no results at all. I'll make a PR as soon as I've redone my other PR, which will hopefully be shortly.

emanuil-tolev commented 8 years ago

@markmacgillivray this is very odd. The quoted job https://compliance.cottagelabs.com/#vnLgjhGX69mBuK8Np has both articles with 0 confidence, and articles with 0.7. But we haven't got identifiers for both 0s and 0.7s - they all have notes like this:

Unable to locate article in EPMC.
Unable to obtain DOI, PMID or PMCID for this article. Compliance information may be severely limited.
Not attempting Crossref or CORE lookups - do not have DOI for article.
Not attempting Grist API grant lookups since no grants data was obtained from EUPMC.
Not checking ahead of print status on pubmed. We don't have the article's PMID.
Not attempting to add any data from Sherpa Romeo - don't have a journal ISSN to use for lookup.
Unable to retrieve licence data via article publisher splash page (used to be OAG) - cannot obtain a suitable URL to run the licence detection on.

None of them have PMCID, PMID or DOI. If it rings a bell as to potentially why, let me know, otherwise I'll keep digging - seems quite odd to have essentially the same totally negative result and two different lookup confidence scores.

richard-jones commented 7 years ago

@emanuil-tolev @markmacgillivray - is this fixed?

markmacgillivray commented 7 years ago

I don't know - looks like the issue is waiting for further input from @emanuil-tolev

markmacgillivray commented 7 years ago

@emanuil-tolev are you going to do this?

I have looked up the code, confidence starts at 0, so for things where we find nothing more, it stays at 0. However if we get as far as doing a non-exact title match against EPMC and we find something that COULD be the article, the confidence is set to 0.7. If we are finding something in EPMC but it somehow does not have a PMCID, then we could have a record with a confidence of 0.7 and yet no PMCID. Does this matter?

emanuil-tolev commented 7 years ago

then we could have a record with a confidence of 0.7 and yet no PMCID. Does this matter?

Yeah, I reckon that might matter, and any similar cases (if there are any). The original feedback above https://github.com/CottageLabs/LanternPM/issues/106#issuecomment-242399920 states exactly that - the article should appear as a 0 to the user if we have not actually been able to get article data.

It's a bit odd otherwise, a 0.7 with no identifiers .. so what are we 0.7 confident in?

markmacgillivray commented 7 years ago

If it is possible for EPMC to have records in it that do not have PMC IDs or PMIDs or DOIs, then a non-exact title match may return back some sort of record with a similar title but no PMC IDs or PMIDs or DOIs. So we would be 0.7 confident that the record EPMC returned is correct, but we would have no useful information to do anything further with.

markmacgillivray commented 7 years ago

I have downloaded the compliance job and will test with the values they are referring to. I am already doing work on the lantern backend anyway, and want to get these issues sorted. Will reassign to myself for further investigation.

markmacgillivray commented 7 years ago

Selecting one that had a title match of 0.7 but no match in eupmc and running a fuzzy match on eupmc returns a result with 4 items, none of which are right, so should result in a match of 0.

https://dev.api.cottagelabs.com/use/europepmc/search/TITLE:A%20review%20of%20frontostriatal%20and%20frontocortical%20brain%20abnormalities%20in%20children

Running the title as the only item in a spreadsheet does indeed return 0.

Ran a subset of 100 of the ~600 in the example problematic job, including the above title, and got results back as expected - did not match the above title, set confidence to 0, and the only title that did get a 0.7 was one that we DID find in eupmc.

So it seems this problem no longer exists as the API currently stands.

CottageLabs / LanternPM

Title matching in EPMC should start with exact match and fall back to fuzzy #106