Closed lshep closed 2 years ago
Hi Lori, @lshep
It looks like the sentence splitter was only seeing alphanumeric characters and missed the (GTF).
. I fixed it in 1.33.5.
643e8b58442bdf3b2f32618be573713eedb59a26
Best,
Marcel
Counting natural language sentences is a harder problem than it sounds: https://stackoverflow.com/questions/12602652/how-to-count-the-number-of-sentences-in-a-text-in-r
Ah, thanks for that! It looks like that is the code that was originally used. Hopefully, maintainers won't use "Dr. Name" in their DESCRIPTION
:)
Yeah hopefully, but it's not uncommon for packages to use things like e.g.
or i.e.
or to cite papers in their description. All these things tend to generate false positives. E.g. the BRAIN package:
views <- read.dcf(url("https://bioconductor.org/packages/3.16/bioc/VIEWS"))
rownames(views) <- views[ , "Package"]
descs <- views[ , "Description"]
sentence_counts <- lengths(strsplit(descs, split="[.!?][[:space:]]+"))
cat(descs[["BRAIN"]], "\n")
# Package for calculating aggregated isotopic distribution
# and exact center-masses for chemical substances (in this
# version composed of C, H, N, O and S). This is an
# implementation of the BRAIN algorithm described in the paper by
# J. Claesen, P. Dittwald, T. Burzykowski and D. Valkenborg.
sentence_counts[["BRAIN"]]
# [1] 6
Some interesting stats:
table(sentence_counts)
# sentence_counts
# 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16
# 631 400 483 279 102 93 53 38 14 6 7 4 1 4 1
Yes, it looks quite involved to parse these out. But I think citing a paper is better than the two sentences they provide so I can live with this result.
Just for fun, I followed the link in @hpages comment. It turns out that posts from 2012 may sometimes no longer be useful & there is a link, in the comment to the top-voted answer, to answers updated in 2014. The answer using qdap, produces
> sent_detect(descs[["BRAIN"]])
[1] "Package for calculating aggregated isotopic distribution and exact center-masses for chemical substances ."
[2] "This is an implementation of the BRAIN algorithm described in the paper by J."
[3] "Claesen, P."
[4] "Dittwald, T."
[5] "Burzykowski and D."
[6] "Valkenborg."
so not much help (but a lot of packages installed!). Using {openNLP} we get
> s = as.String(descs[["BRAIN"]])
> sent_token_annotator <- Maxent_Sent_Token_Annotator()
> a1 <- annotate(s, sent_token_annotator)
> a1
id type start end features
1 sentence 1 152
2 sentence 154 286
> ## Extract sentences.
> s[a1]
[1] "Package for calculating aggregated isotopic distribution\nand exact center-masses for chemical substances (in this\nversion composed of C, H, N, O and S)."
[2] "This is an\nimplementation of the BRAIN algorithm described in the paper by\nJ. Claesen, P. Dittwald, T. Burzykowski and D. Valkenborg."
Which is pretty neat (but requires rJava / Java, which was a little challenging on my macOS)!
I tried to follow the Python suggestion in the second answer to the original question, but couldn't clearly parse the code to just 'copy-and-paste' an answer.
I tried to follow the Python suggestion in the second answer to the original question, but couldn't clearly parse the code to just 'copy-and-paste' an answer.
It was a fair warning at http://blog.thegrandlocus.com/2012/06/The-elements-of-style :grin:
Be warned, though that the script will most likely not run with other inputs in such an easy way.
Not sure how this is determined but I'm getting a false positive note on the description length for factR that I am reviewing. Truncated but DESCRIPTION included:
It is technically three sentences but I am getting the NOTE