drivenbyentropy / aptasuite

A full-featured bioinformatics software collection for the comprehensive analysis of aptamers in HT-SELEX experiments.
https://drivenbyentropy.github.io/
GNU General Public License v3.0
24 stars 11 forks source link

Reverse-complemented sequences in raw data #54

Closed PJpb closed 6 years ago

PJpb commented 6 years ago

As promised - here we go... ;)

Hi, I found out that in my dataset (paired-end fastq from Illumina), around half of the sequences is reverse complemented in regards of the actual selection library direction (which is not unexpected), i.e. they are in correct direction in the second "reverse" file. It seems that AptaSuite is not taking that into acount, i.e. during parsing it discards the rev-comp sequences as not matching the 3' primer. I've tried to input both the actual 3' primer and the rev-comp of the 3' primer, neither works. Anything I can do to make it work, apart from reverse-complementing (using other tools) the sequences which are not forward-oriented?

Best regards, PJ

drivenbyentropy commented 6 years ago

Hi :+1:

This feature was available in the original implementation of AptaPlex but I had not yet gotten around to re-implement it here. This ticket is the perfect motivator. I have created a preliminary version of AptaPlex that takes into account reverse-complemented reads if desired but I have no data to test it on. Would you be able to provide a small sample that contains this type of data for me to test before I push the update?

Thanks and all the best!

PJpb commented 6 years ago

I'll do it this week, hopefully.

PJpb commented 6 years ago

Hi, Please find sample data below: SampleFwdRev.txt

Forward primer: GTCTCCATTCTAATGATC Reverse primer: AGACATGCCTTATTAGCG

The structure of the amplicons sequenced is as follows: (3-7Ns)-(6nt barcode)-(18nt primer Fwd/Rev)-(40nt random region)-(18nt primer)-(6nt barcode)-(3-7Ns) Barcodes are the same on both ends (reversed-complement though at 3')

If there's anything you need to implement it please let me know.

drivenbyentropy commented 6 years ago

Hi @PJpb ,

I am currently working on implementing this feature, however I believe the reverse primer of your test set is not the one you provided but should be TATCGGCGGAATGCACTC (5' to 3' of the read direction). Could you please verify this? With the one you provided I am not getting any accepted reads.

Thanks!

drivenbyentropy commented 6 years ago

Looking further into this data I believe I need some more assistance. I have printed out the reads which AptaPlex fails to parse. My expectation was that most these reads would be the reverse complement of version of the form (ignoring barcodes and 3-7nt regions): reverse complement of the above 3' primer ---- 40nt random region ---- reverse complement of the above 5'primer.

However, what I am observing is the following pattern (again, I removed everything but the primers and RR):

ACTCTTGCCTCGGCAGCT AGGCAGGCCGTGGACTTTAACGTCGACACGTGCGGGGGGC CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT CAAGCGACTGCCCGATGCTTGACGGACCCGTCGGGGCCGC GCTAATAAGGCATGTCTC
ACTCTTGCCTCGGCAGCT CGCCAGATCCCGCGGGTTGAAGAGTAGCAGGAGGGTCAGA CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT GAACAGCACATAAGGACGTGGCTTAGCCCCCGAACAGAGC CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT CGGTGGCTGTCCCCATGGGGCGTGACCAGATCTCTTTCGA CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT CGGTGCACGACACCCCTTCCGGATCCAGATAGGAGGGCGG CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT GGACTTAGTTCAGCACGGAGCAGCATGCTAGGGGCGTGCG CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT GAGCCAAGCAGTGTGCGAACCACGTGGAGGCCGGAGGCAG CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT CGGCGAACAAGCGTCTGGGGCACGAGGGCCCCGGGGCCAG CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT GGACAGACGACCTGCAATTCAGGTGACGGTAGTAGGTCAC CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT GCGAATGTTTCGGGCCTGCGGAGTAGAGGCCCCCAGCCCG CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT AAACGCGAGTTGTTGCGCGTTTCCGGCACCCTCGGCGCAG CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT AAACGACACGGCGCACGGGGCCTAACCACTGGCGTGGGCG CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT CGCGACTTGGACGACCCACACCGGTTGGCATGGCGTTAGA CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT AACCCCAGGACGGCGATACACGGATCAAAAGTGATTGGAC CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT CAAAACGCGGCAATCGGCCATGAACGCCCACGGCCAGGCA CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT CCTCAGCGAGCGAACCCCTTCCTCGCATCATAGGACCCCC CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT ACATTTTCGTTATACCTCTCCGTTCGGCCGCGAGGTAACG CGCTAATAAGGCATGTCT
ACTCTTGCCTCGGCAGCT ACAGGTTCCACCCGAGCACGTCGAGGCAGCAGATTAGAAA CGCTAATAAGGCATGTCT

Here, ACTCTTGCCTCGGCAGCT does not correspond to the reverse complement of the 3' primer and CGCTAATAAGGCATGTCT does not correspond to the reverse complement of the 5' primer and I was hoping you could help me figure out what I am missing.

Thanks!

PJpb commented 6 years ago

The data is wrong and this is my fault, sorry. Will provide new file tomorrow. PJ

PJpb commented 6 years ago

Please disregard the previous file completely, the mistaken data was due to my error in "encoding" it, not due to your reasoning or understanding. I'm really sorry to have wasted your time there.

Please find a different, hopefully correct file: SampleSet5.txt And primers in the orientation of the physical primer:

Fwd: TTAATGGATGGCTCCAGG Rev: GATAGTACATCGGGACGG

Example reads, with primers as provided in bold, and reverse-complement of the primers in italics.

Correct orientation (the "selected" orientation or the "positive" strand):

TTTTGAGCATTAATGGATGGCTCCAGGAGCTCTGTTTGAATACGACTTTGCATTTCGAGCGAATATTCCGTCCCGATGTACTATCTGCTCAAGTAGATCGG

Incorrect orientation (the "non-selected" strand or the "negative" strand):

CGAGCTACGTGATAGTACATCGGGACGGCCTCGTACACGGCCACACGTTGGTTGATTGGGAGAACGCTCCTGGAGCCATCCATTAAACGTAGTACTAGATC

I hope it's all fine now, once again I'm sorry for the confusion!

edit: also, according to my quick search, about 3.5% of the reads does not contain a full primer or contains some mutation in the primer region (I've performed a normal solid text search for both primers and the numbers don't sum up to the total number of the reads in the file provided, by about 350 out of 10,000).

drivenbyentropy commented 6 years ago

No worries, I just wanted to make sure we were on the same page :)

I believe this feature is now working and will be included in release AptaSuite v0.9.1 and above. The Wiki has also been updated to reflect the fact that a new option for AptaPlex can now be used to activate this function (it is disabled by default as it adds approximately twice the runtime to the parsing process). You can add AptaplexParser.CheckReverseComplement = True to your configuration file to change this behavior, or select the option in the Advanced Options under the Parser Options Tab.

As for your test set (thanks again!), without this feature active I get the following recovery rate:

Starting AptaPlex:
Parsing...
Total Reads:            Accepted Reads:         Contig Assembly Fails:  Invalid Alphabet:       5' Primer Error:        3' Primer Error:        Invalid Cycle:          Total Primer Overlaps:
10000                   4865                    0                       0                       5016                    115                     0                       1
Parsing Completed in 2.003 seconds.

With AptaplexParser.CheckReverseComplement = True, this changes to

Starting AptaPlex:
Parsing...
Total Reads:            Accepted Reads:         Contig Assembly Fails:  Invalid Alphabet:       5' Primer Error:        3' Primer Error:        Invalid Cycle:          Total Primer Overlaps:
10000                   9713                    0                       0                       246                     37                      0                       1
Parsing Completed in 28.014 seconds.

Regarding the incomplete/mutated primer regions, I have not played with changing the primer tolerances (again under Advanced Options in the import wizard or changing the parameter AptaplexParser.PrimerTolerance) but feel free to give some other values a go and let me know if that improves things.

drivenbyentropy commented 6 years ago

Closing 149872af9e244c0d6076b2dbe754d1c2f8c13a22.