Open philippmuench opened 6 years ago
Which SRA run are you using? When I search for SRS140663 I get multiple results
Thanks for your reply! I uploaded the file to https://drive.google.com/file/d/1wNTirS5H79hgs-l00-gf0prMzpA7lbvz/view?usp=sharing the log file is available here: https://drive.google.com/open?id=1VmkfNz2du7R9De-RHD_SUVWKUBZFXjfD The behavior is also the same on a different machine.
Thanks again.
I think it's doing the same thing on my machine. Interestingly it's in the final xml file writing step, which has never been a problem before. I'll see what I can find
Thank you for looking into it! I have a bunch of more files for which this happens, would it helps you to see these files too?
I'm not quite sure what the problem is but the number of identified reads in the sample you sent is huge (~7,000,000 of ~10,000,000). I've never seen so many reads get found before and I think that the data structures I use are not efficient enough to handle this size. I'm trying to figure out why so many reads are identified in the first place as I can't believe that 70% of this dataset is CRISPR.
Thank you for looking into it! Yes maybe this is caused by human contamination or adapters/primer sequences in data. Would it possible that Crass throws an error in such case and terminate it?
It does appear to be eukaryotic contamination. I don't think there is a good way to detect this systematically but I'll put in a test that if too many reads are identified to stop.
Many thanks for looking into it! Maybe you can describe how you come to the conclusion that this is caused by eukaryotic contamination? I mapped this samples against hg19 without much success, so maybe its a different contamination or something different e.g. adapter sequences.
When I looked at the file I saw many reads that appeared to have a poly-A tail - they also appeared to be very similar to each other (i.e. many copies of the same thing) I blasted some of these and they came back as microRNAs from Carp and other Eukaryotes.
For some samples, Crass seems to hang (for days, until it gets killed) at the
crass_patternFinder
step. I tested it forfastq
andfasta
input of the same file. When inspecting the output it seems that Crass runs without error and its ends with1m 11s I Writing XML output to "SRS140663_output/crass.crispr"
. However, nocrass.crispr
gets written.I use the last version. I think this only happens for big samples with > 10M reads.