Bioconductor / ShortRead

8 stars 6 forks source link

countFastq can overflow #10

Closed pfh closed 1 year ago

pfh commented 1 year ago

countFastq can overflow and return a negative number of bases.

It looks like the count_records C function returns results as 32 bit integers. Maybe better to return them as doubles, which can represent up to 52 bit integers accurately.

Example:

n <- 3000000 #Reads
m <- 1000    #Read length
dna <- paste(rep("A",m),collapse="")
qual <- paste(rep("J",m),collapse="")
sink("example.fastq")
for(i in 1:n) 
    cat(paste0("@read",i,"\n",dna,"\n+\n",qual,"\n"))
sink()

library(ShortRead)
countFastq("example.fastq")

#Result:
#              records nucleotides      scores
# example.fastq   3e+06 -1294967296 -1294967296

P.S. This is a very handy function, thank you! Much better than needing a command line tool to get fastq statistics.

mtmorgan commented 1 year ago

Thanks Paul; if you have a chance can you test

BiocManager::install("Bioconductor/ShortRead", ref = "issue-10")
pfh commented 1 year ago

Thanks, it now works as I expected.

              records nucleotides scores
example.fastq   3e+06       3e+09  3e+09