drivenbyentropy / aptasuite

A full-featured bioinformatics software collection for the comprehensive analysis of aptamers in HT-SELEX experiments.
https://drivenbyentropy.github.io/
GNU General Public License v3.0
24 stars 11 forks source link

AptaSuite parsing crush - invalid alphabet - help wanted #85

Closed User-89 closed 2 years ago

User-89 commented 4 years ago

Hi, I have been using aptasuite-0.9.4 for over a year now. I used to import multiplexed or demultiplexed data from NGS sequencing as fastq files. However, now I encountered a problem during parsing, which manifest itself as follows:

  1. While importing, all data concerning uploaded fastq file (total processed reads, accepted reads, invalid alphabet, (...), invalid cycle) is continuously displayed and looks OK 2. Then, all at sudden the count for "invalid alphabet" kind of jumps from a number 100-1000 to over 200 000 up to 2 millions! 3. Then, right after that happens the program crushes and displays the communicate:

Reading configuration from file. Instantiating MapDBAptamerPool Processing selection cycle R5s2 Loading took 18743 milliseconds Exception in thread "pool-2-thread-1" java.lang.NullPointerException at lib.parser.aptaplex.FastqReader.getNextRead(FastqReader.java:128) at lib.parser.aptaplex.AptaPlexProducer.run(AptaPlexProducer.java:180) at java.lang.Thread.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) or

Reading configuration from file. Instantiating MapDBAptamerPool Processing selection cycle R4s Loading took 19226 milliseconds Exception in thread "pool-2-thread-1" java.lang.NullPointerException at lib.parser.aptaplex.FastqReader.getNextRead(FastqReader.java:126) at lib.parser.aptaplex.AptaPlexProducer.run(AptaPlexProducer.java:180) at java.lang.Thread.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

It happened so far in 10 out of 13 demultiplexed files and in 2 out of 2 multiplexed files. I checked these files (especially around the lines where AptaSuit encounters problem) and they seem correct and for me it looks a bit like AptaSuite parser "jumps over" the sequence line at some point and starts importing other lines (seq identifier/quality score?) as sequence lines hence suddenly so many invalid alphabet counts (out of ACTG range).

Please, let me know if you have any idea what could have happened and how to resolve this problem. Raw data uploaded to AptaSuite are always pre-prepared on UseGalaxy platform, using the same workflow as always, however some tools were updated over time, although I do not think it would have an influence on the correctness of fastq data imported to AptaSuite. On Galaxy platform all fastq files look correct and have been checked using FastQC tool as well as line by line, around the problematic lines, when AptaSuite crushes.

I attached logs for these two samples I mentioned above and screens of how it looks like when parser crushes log_2020-04-06_18-18-53.txt log_2020-04-06_16-35-21.txt R4s-error_png Rs2-error_png

drivenbyentropy commented 4 years ago

Thank you for the detailed report. I do agree that his could be related to a line skipping in the files. However, this appears to be more related to the input data than to AptaSuite.

Could you please verify the validity of the files with a FastQ validator (a tool that checks line by line for correctness). FastQValidator comes to mind here.

Thank you and please let me know what you find.