Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
54 stars 16 forks source link

trinucleotideFrequency Segmentation Fault if Input Data is Large #52

Closed DarioS closed 3 years ago

DarioS commented 3 years ago

My aim is to compute entropy of all of the unmapped reads for each patient sample of whole genome sequencing data and remove low-complexity sequences before doing a microbiome analysis. Filtering works for the smaller samples, but segmentation faults on the larger ones. I use R 4.1.0 and Biostrings 2.60.1. Ihave narrowed the crashing down to trinucleotideFrequency.

> testCase <- readFastq("OSCC_1-N_unmapped_all.fastq.gz")
testCase 
> testCase
class: ShortReadQ
length: 48324630 reads; width: 150 cycles
> trinucleotideFrequency(sread(testCase))

 *** caught segfault ***
address 0x7fa808276fb8, cause 'memory not mapped'

Traceback:
 1: .Call2("XStringSet_oligo_frequency", x, width, step, as.prob,     as.array, fast.moving.side, with.labels, simplify.as, base_codes,     PACKAGE = "Biostrings")
 2: .local(x, width, step, as.prob, as.array, fast.moving.side, with.labels,     ...)
 3: oligonucleotideFrequency(x, 3L, step = step, as.prob = as.prob,     as.array = as.array, fast.moving.side = fast.moving.side,     with.labels = with.labels, ...)
 4: oligonucleotideFrequency(x, 3L, step = step, as.prob = as.prob,     as.array = as.array, fast.moving.side = fast.moving.side,     with.labels = with.labels, ...)
 5: trinucleotideFrequency(sread(testCase))
vjcitn commented 3 years ago

Can you make the offending data available?

DarioS commented 3 years ago

I do one better and write a minimal, reproducible example. Copy and paste away!

library(stringi)
library(Biostrings)
reads <- DNAStringSet(stri_rand_strings(40000000, 150, pattern = "[GATC]"))
trinucleotideFrequency(reads) # *** caught segfault ** address 0x7f498a1c4040, cause 'memory not mapped'
hpages commented 3 years ago

Crash even faster with:

library(Biostrings)
reads <- rep(DNAStringSet("GGACGTCC"), 5e7) 
res <- trinucleotideFrequency(reads)

Should be fixed in Biostrings 2.61.2 (BioC 3.14) and Biostrings 2.60.2 (BioC 3.13). See commit 050d5722d82f950c7c5dcb138726604e1446810d

H.