Closed luketudge closed 4 years ago
Thanks for finding this problem, @luketudge! Sorry to take forever to get around to this, but I believe this is now fixed by 44d3252 which cleans up the code that was created when merging with synthesisr.
Thanks! Nicely fixed. And neater too, since the feature columns of the DFM are now simply those requested with the features
argument to create_dfm()
.
dfm <- create_dfm(
elements = c(
"Black-backed woodpecker occupancy in burned and beetle-killed forests",
"Burnt and black-backed woodpeckerless forests: A sad prospect",
"Can black-backed woodpeckers get sunburn?"
),
features = c("black-backed woodpecker", "burn")
)
colnames(dfm)
[1] "black-backed woodpecker" "burn"
I see that
create_dfm()
does some stemming and partial matching, which is very useful. But in some cases it is overzealous and adds in new unwanted features. Here is an example:In addition to the requested terms, we get any term that contains a requested term. This might be desirable in a lot of cases, but even when it is, any derived terms that are matched this way should probably not appear as new terms, but instead be counted as instances of the root term.