elizagrames / litsearchr

litsearchr is an R package to partially automate search term selection for systematic reviews using keyword co-occurrence networks. In addition to identifying search terms, it can write Boolean searches and translate them into over 50 languages.
https://elizagrames.github.io/litsearchr
101 stars 24 forks source link

hyphens in check_recall() #47

Open luketudge opened 3 years ago

luketudge commented 3 years ago

23032fc8bcbf9e79f810c758fde3a52931b941f2 changes the behavior of check_recall() so as to remove punctuation and ignore case, which improves matching in most cases.

But this seems to introduce a new problem if a target title and a retrieved title differ only in the use of a space instead of a hyphen. (Which I guess could be a reasonably common discrepancy in some fields?) If a target title contains a hyphen near the beginning of the string where the true match contains a space, the removal of the hyphen causes subsequent characters in the target to 'mis-align' with the true match, and can result in a better match with a very different title that happens to align with a small part of the target after the removal of hyphens.

Maybe clearer with an example:

target <- c("Black-backed woodpecker research and the hyphen controversy: A review.")
titles <- c("Black backed woodpecker research and the hyphen controversy: A review.",
            "Irrelevant but same-length titles in the hyphen controversy: A review.")
check_recall(target, titles)
     Title                                                                   
[1,] "Black-backed woodpecker research and the hyphen controversy: A review."
     Best_Match                                                               Similarity         
[1,] "Irrelevant but same-length titles in the hyphen controversy: A review." "0.492537313432836"

Perhaps the better thing to do here is to just leave hyphens untouched by using litsearchr::remove_punctuation() with preserve_punctuation = c("-") instead of tm::removePunctuation()?