Substring drug names separated by space leads to false positives (e.g. 'Acetyl Norfentanyl' and 'Norfentanyl')

Knoxort commented 5 months ago

We seem to run into an issue where if one drug is a substring of another drug, specifically with a prefix word separated by a space, then both drugs will be identified. For example, consider the field below. Our search terms included "Norfentanyl" and "Acetyl Norfentanyl". However, Norfentanyl doesn't exist on its own in this field, but it was still marked as found with a similarity score of 1. This makes sense, since it is surrounded by tokens that would presumably be discarded, but I would consider this a false positive. Is this a valid reading of the situation? If so, is there any way to address this in the tool, or any logic we could use to address this?

If we were searching for "Fentatnyl" as well as any fentanyl analogs, this may happen a lot more. Perhaps if I could access a sort of count to see how many times the term appears in the field, this would help to see if Norfentanyl appearing was on its own or part of a larger phrase.

nanthony007 commented 4 months ago

It's important to remember the tool is, first and foremost, a string similarity tool. It has no concepts of drugs. Thus when you search for "Norfentanyl" it looks at all the unique unigrams in the text and matches it to another Norfentanyl. It does not care that it was preceded by "Acetyl". Likewise this is why "acetyl fentanyl" will only match once on the bigram.

I think your issue brings up two points that are similar but distinct.

Search terms that are greater than N+1 ngrams that are a substring of other N-gram search terms (I.e. Acetyl fentanyl and fentanyl)
Search terms that are the same N-gram length that are sub strings.

The latter case can be solved using the tool by post-processing and filtering. Using the similarity score or edit distance. A canonical example would be methamphetamine and amphetamine. Assuming "amphetamine" in the search list, the tool would detect both of these. Usually this case does not occur due to the longer term being too dissimilar to the substring to be detected by the tool and even if it is it can be removed by postprocessing on the metrics.

The former case is more intricate. Because the tool goes by the current N-gram size there is no way for it to know Acetyl fentanyl and fentanyl are both in the search term list. So in this context the tool is functioning correctly. However I see it is producing duplicate results. Unfortunately the tool doesn't have a concept of "negative" discoveries or "location" so w don't have a way to identify that those subsequent matches are substrings.

Let me put some thought against how to tackle this.

Knoxort commented 4 months ago

What you're saying makes sense. I would say that I only meant to address N + 1 grams point in this issue. I think it's also totally valid to say that this case is beyond the scope of the tool. I brought it up because, based on my limited knowledge of pharmacology, I thought that search terms being subsets of other search terms may have been common enough that the case may have been discussed already, whether it be by making changing to the tool or with peripheral tools, and I didn't want to reinvent the wheel. I didn't see anything similar in issues, so I figured I'd check.

I had an idea or two, but I'm a bit wary that they're...naive. I may bring them up at the next meeting/after Arjun returns so we can discuss.

nanthony007 commented 4 months ago

Yeah let's discuss at next meeting. I'm definitely interested in this and looking to build another variant of the tool that is more flexible (allowing PDFs, etc) and maybe this could be used there.

I'm tempted to agree, at least for now, that it is beyond the scope of this tool. But I think we can find some pre/post steps that allow for better inference. Such as running the tool multiple times or processing the row-level output.

Knoxort commented 4 months ago

Gotcha. We'll talk today and then resolve the issue or whatnot from here.

UK-IPOP / drug-extraction

Substring drug names separated by space leads to false positives (e.g. 'Acetyl Norfentanyl' and 'Norfentanyl') #84