Bioconductor / BiocCheck

http://bioconductor.org/packages/BiocCheck
8 stars 26 forks source link

description length check / false positive #160

Closed lshep closed 2 years ago

lshep commented 2 years ago

Not sure how this is determined but I'm getting a false positive note on the description length for factR that I am reviewing. Truncated but DESCRIPTION included:

Package: factR
Title: Functional Annotation of Custom Transcriptomes
Version: 0.99.4
Authors@R: person("Fursham", "Hamid", 
    email = "fursham.h@gmail.com", 
    role = c("aut", "cre"))
Description: factR contain tools to process and interact with custom-assembled 
    transcriptomes (GTF). At its core, factR constructs CDS information on 
    custom transcripts and subsequently predicts its functional output. 
    In addition,  factR has tools capable of plotting transcripts, 
    correcting chromosome and gene information and shortlisting new transcripts.
Depends: R (>= 4.2)
biocViews: AlternativeSplicing, FunctionalPrediction, GenePrediction

It is technically three sentences but I am getting the NOTE

* NOTE: The Description field in the DESCRIPTION is made up by less than 3
sentences. Please consider expanding this field, and structure it as a
full paragraph
LiNk-NY commented 2 years ago

Hi Lori, @lshep It looks like the sentence splitter was only seeing alphanumeric characters and missed the (GTF).. I fixed it in 1.33.5. 643e8b58442bdf3b2f32618be573713eedb59a26 Best, Marcel

hpages commented 2 years ago

Counting natural language sentences is a harder problem than it sounds: https://stackoverflow.com/questions/12602652/how-to-count-the-number-of-sentences-in-a-text-in-r

LiNk-NY commented 2 years ago

Ah, thanks for that! It looks like that is the code that was originally used. Hopefully, maintainers won't use "Dr. Name" in their DESCRIPTION :)

hpages commented 2 years ago

Yeah hopefully, but it's not uncommon for packages to use things like e.g. or i.e. or to cite papers in their description. All these things tend to generate false positives. E.g. the BRAIN package:

views <- read.dcf(url("https://bioconductor.org/packages/3.16/bioc/VIEWS"))
rownames(views) <- views[ , "Package"]
descs <- views[ , "Description"]
sentence_counts <- lengths(strsplit(descs, split="[.!?][[:space:]]+"))

cat(descs[["BRAIN"]], "\n")
# Package for calculating aggregated isotopic distribution
# and exact center-masses for chemical substances (in this
# version composed of C, H, N, O and S). This is an
# implementation of the BRAIN algorithm described in the paper by
# J. Claesen, P. Dittwald, T. Burzykowski and D. Valkenborg. 

sentence_counts[["BRAIN"]]
# [1] 6

Some interesting stats:

table(sentence_counts)
# sentence_counts
#   1   2   3   4   5   6   7   8   9  10  11  13  14  15  16 
# 631 400 483 279 102  93  53  38  14   6   7   4   1   4   1
LiNk-NY commented 2 years ago

Yes, it looks quite involved to parse these out. But I think citing a paper is better than the two sentences they provide so I can live with this result.

mtmorgan commented 2 years ago

Just for fun, I followed the link in @hpages comment. It turns out that posts from 2012 may sometimes no longer be useful & there is a link, in the comment to the top-voted answer, to answers updated in 2014. The answer using qdap, produces

> sent_detect(descs[["BRAIN"]])
[1] "Package for calculating aggregated isotopic distribution and exact center-masses for chemical substances ."
[2] "This is an implementation of the BRAIN algorithm described in the paper by J."
[3] "Claesen, P."
[4] "Dittwald, T."
[5] "Burzykowski and D."
[6] "Valkenborg."

so not much help (but a lot of packages installed!). Using {openNLP} we get

> s = as.String(descs[["BRAIN"]])
> sent_token_annotator <- Maxent_Sent_Token_Annotator()
> a1 <- annotate(s, sent_token_annotator)
> a1
 id type     start end features
  1 sentence     1 152
  2 sentence   154 286
> ## Extract sentences.
> s[a1]
[1] "Package for calculating aggregated isotopic distribution\nand exact center-masses for chemical substances (in this\nversion composed of C, H, N, O and S)."
[2] "This is an\nimplementation of the BRAIN algorithm described in the paper by\nJ. Claesen, P. Dittwald, T. Burzykowski and D. Valkenborg."

Which is pretty neat (but requires rJava / Java, which was a little challenging on my macOS)!

I tried to follow the Python suggestion in the second answer to the original question, but couldn't clearly parse the code to just 'copy-and-paste' an answer.

LiNk-NY commented 2 years ago

I tried to follow the Python suggestion in the second answer to the original question, but couldn't clearly parse the code to just 'copy-and-paste' an answer.

It was a fair warning at http://blog.thegrandlocus.com/2012/06/The-elements-of-style :grin:

Be warned, though that the script will most likely not run with other inputs in such an easy way.