ctSkennerton / crass

The CRISPR assembler
http://ctskennerton.github.io/crass
GNU General Public License v3.0
35 stars 11 forks source link

Crass hangs on the patternFinder step #89

Open philippmuench opened 6 years ago

philippmuench commented 6 years ago

For some samples, Crass seems to hang (for days, until it gets killed) at the crass_patternFinder step. I tested it for fastq and fasta input of the same file. When inspecting the output it seems that Crass runs without error and its ends with 1m 11s I Writing XML output to "SRS140663_output/crass.crispr". However, no crass.crispr gets written.

I use the last version. I think this only happens for big samples with > 10M reads.

>crass SRS140663.fasta -o SRS140663_output
[crass_patternFinder]: Processed 10869052 ...34 sec
[crass_clusterCore]: 546 variants mapped to 53 clusters
[crass_clusterCore]: creating non-redundant set
[crass_clusterCore]: 324 non-redundant patterns.
[crass_singletonFinder]: Processed 10869052 ...28 sec
[crass_patternFinder]: Found 7935100 reads
Killed
ctSkennerton commented 6 years ago

Which SRA run are you using? When I search for SRS140663 I get multiple results

philippmuench commented 6 years ago

Thanks for your reply! I uploaded the file to https://drive.google.com/file/d/1wNTirS5H79hgs-l00-gf0prMzpA7lbvz/view?usp=sharing the log file is available here: https://drive.google.com/open?id=1VmkfNz2du7R9De-RHD_SUVWKUBZFXjfD The behavior is also the same on a different machine.

Thanks again.

ctSkennerton commented 6 years ago

I think it's doing the same thing on my machine. Interestingly it's in the final xml file writing step, which has never been a problem before. I'll see what I can find

philippmuench commented 6 years ago

Thank you for looking into it! I have a bunch of more files for which this happens, would it helps you to see these files too?

ctSkennerton commented 6 years ago

I'm not quite sure what the problem is but the number of identified reads in the sample you sent is huge (~7,000,000 of ~10,000,000). I've never seen so many reads get found before and I think that the data structures I use are not efficient enough to handle this size. I'm trying to figure out why so many reads are identified in the first place as I can't believe that 70% of this dataset is CRISPR.

philippmuench commented 6 years ago

Thank you for looking into it! Yes maybe this is caused by human contamination or adapters/primer sequences in data. Would it possible that Crass throws an error in such case and terminate it?

ctSkennerton commented 6 years ago

It does appear to be eukaryotic contamination. I don't think there is a good way to detect this systematically but I'll put in a test that if too many reads are identified to stop.

philippmuench commented 6 years ago

Many thanks for looking into it! Maybe you can describe how you come to the conclusion that this is caused by eukaryotic contamination? I mapped this samples against hg19 without much success, so maybe its a different contamination or something different e.g. adapter sequences.

ctSkennerton commented 6 years ago

When I looked at the file I saw many reads that appeared to have a poly-A tail - they also appeared to be very similar to each other (i.e. many copies of the same thing) I blasted some of these and they came back as microRNAs from Carp and other Eukaryotes.