drivenbyentropy / aptasuite

A full-featured bioinformatics software collection for the comprehensive analysis of aptamers in HT-SELEX experiments.
https://drivenbyentropy.github.io/
GNU General Public License v3.0
24 stars 11 forks source link

[New Feature Suggestion]Undetermined Sequence, Reason for Dumping #79

Open Eggsorer opened 4 years ago

Eggsorer commented 4 years ago

TL, DR: I would like to have a feature where it prints out the reason why aptasuite rejected every one of the rejected sequences. And a summary of rejected sequences.

I have been using this feature where you can export undetermined sequences to fastq.gz file.

AptaplexParser.UndeterminedToFile = False

The main reason for me to use this feature is that I need to monitor the quality of the library. For that I need to know why certain sequences are rejected. And the percentage of those type of sequences among the total rejected sequences. And it is hard for me to do so now while aptasuite only spits out the rejected sequences but without printing out a reason. Take my current work for examples. I have a round where aptasuite rejected 10 million reads. From a gel I ran before sequencing, it seems that there might be 40% of the RNA are smaller than my aptamer. I want to know what those sequences are and if they contain aptamer primers or not. I also wants to know among those sequences that are rejected because of too many mutations in primer, how many mutations do they carry. With current aptasuite features I do get rejected sequences exported. But it is just a fastq file. It's too large to process it with by-eye-informatics. I still need to write a script to process it and that's a daunting task for me. But I think that with aptasuire, within the IF branch that you dump out rejected sequence, it's easy for you to also print out the reason of rejection either in a separate file or inserted into fastq file as a 5th line. With the latter solution, it's easier to process with a script. And if anyone wants a real fastq file (s)he can simply use sed/awk to finish the job.

Let me know what you think of this suggestion. And thank you for this project.