elizagrames / litsearchr

litsearchr is an R package to partially automate search term selection for systematic reviews using keyword co-occurrence networks. In addition to identifying search terms, it can write Boolean searches and translate them into over 50 languages.
https://elizagrames.github.io/litsearchr
101 stars 24 forks source link

Unwanted extra features in dfm? #40

Closed luketudge closed 4 years ago

luketudge commented 4 years ago

I see that create_dfm() does some stemming and partial matching, which is very useful. But in some cases it is overzealous and adds in new unwanted features. Here is an example:

dfm <- create_dfm(
  elements = c(
    "Black-backed woodpecker occupancy in burned and beetle-killed forests",
    "Burnt and black-backed woodpeckerless forests: A sad prospect",
    "Can black-backed woodpeckers get sunburn?"
  ),
  features = c("black-backed woodpecker", "burn")
)

print(dfm$dimnames$Terms)
[1] "black-backed woodpecker" "burned" "black-backed woodpeckerless" "burnt" "sunburn"

In addition to the requested terms, we get any term that contains a requested term. This might be desirable in a lot of cases, but even when it is, any derived terms that are matched this way should probably not appear as new terms, but instead be counted as instances of the root term.

elizagrames commented 4 years ago

Thanks for finding this problem, @luketudge! Sorry to take forever to get around to this, but I believe this is now fixed by 44d3252 which cleans up the code that was created when merging with synthesisr.

luketudge commented 4 years ago

Thanks! Nicely fixed. And neater too, since the feature columns of the DFM are now simply those requested with the features argument to create_dfm().

dfm <- create_dfm(
    elements = c(
        "Black-backed woodpecker occupancy in burned and beetle-killed forests",
        "Burnt and black-backed woodpeckerless forests: A sad prospect",
        "Can black-backed woodpeckers get sunburn?"
    ),
    features = c("black-backed woodpecker", "burn")
)
colnames(dfm)
[1] "black-backed woodpecker" "burn"