drivenbyentropy / aptasuite

A full-featured bioinformatics software collection for the comprehensive analysis of aptamers in HT-SELEX experiments.
https://drivenbyentropy.github.io/
GNU General Public License v3.0
24 stars 11 forks source link

[New Feature Needed]T7 Promoter Exclusion in AptaCluster #50

Open Eggsorer opened 6 years ago

Eggsorer commented 6 years ago

It seems that in AptaCluster, you are performing clustering with the entire DNA sequences. In RNA aptamers, the DNA sequences contains T7 promoter which will not appear in RNA aptamer molecules. So I think that it is necessary to exclude the T7 promoter sequences from clustering (at least for RNA aptamers). Right now, the T7 promoter sequences are being considered the 5'primer. I think users should be given the option to enter the T7 promoter sequences and the 5' fix region of aptamers. And then the sequences that we will be clustering on are aptamer sequences (combination of 5'fix region, random region and 3' fix region). In this regard, I think users should be given the option whether or not to combine the reads for sequences that have the same aptamer sequences but different T7 promoter sequences. Since the sequences HTS generated are from cDNA derived from the RNA aptamer pool, the mutated T7 promoter will be likely due to error in sequencing, or primer manufacturing. The RNA aptamers sequences on which cDNA are derived from will be the same regardless. So perhaps we can combine the reads of cDNA sequences with different T7 promoter sequences but same RNA aptamer sequences. Personally, the first request is what I strongly need. But I feel that we need to discuss this thoroughly. Let me know what you think.

drivenbyentropy commented 6 years ago

Hey, you are absolutely correct, AptaCluster should only take the randomized region into account. This was indeed the case with the reference implementation of the original paper and I was convinced I had implemented it this way in AptaSuite, but looking at the code, it seems I missed this somehow.

I will fix this and release a new version.

Good catch, thank you!

Eggsorer commented 6 years ago

Just to make sure that we had a mutual understanding. I think that AptaCluster should not take just random region into account. Instead, it should, cluster 5' fixed/primer region, random region and 3'fixed /primer region all together. However T7 promoter region which precede 5'fixed region should be excluded.

Theoretically, 5' primer and 3' primer region should be exactly the same. Any mutation in those regions that showed up in sequencing results, can only come from either sequencing error or errors in primer synthesis. While we can ignore the former situation, we must still pay attention to the latter situation. That is because in the latter situation, the aptamers with mutated primer region do physically exist. That being sad, I think there is no way of telling whether it is a primer synthesis problem or sequencing problem. So I think that perhaps you can leave it as an option, like "DiscardErraticPrimer=True/False".