drivenbyentropy / aptasuite

A full-featured bioinformatics software collection for the comprehensive analysis of aptamers in HT-SELEX experiments.
https://drivenbyentropy.github.io/
GNU General Public License v3.0
24 stars 11 forks source link

Exclude primers from sequence comparison #7

Closed PJpb closed 6 years ago

PJpb commented 6 years ago

Hi, I was pretty surprised to find out that the clustering algorithm differentiates between sequences with the same random region and different primer regions (misreads). Example:

Aptamer12004 GTATACCTGCAGCTGAGG_ GCAACACGTGGCAATAGGCTGTGCTGTGTTAGGTGCTGTGATAT
GATGACACTACGTGACCA 478 Aptamer592589 GGTAACCTGCAGCTGAGG_ GCAACACGTGGCAATAGGCTGTGCTGTGTTAGGTGCTGTGATAT
GATGACACTACGTGACCA 3

In these two sequences, belonging to the same clusters, the random region is exactely the same, while there are some misread bases in the 5' primer region, leading to those two sequences being considered as distinct aptamers.

In the same time, while exporting the clusters, combined cluster size (i.e. sum of RPM of all cluster members) is not calculated, and the export to a text file (filtered via Export.MinimalClusterSize) is based not on cluster's total RPM, but the number of distinct members (who can differ both in the random region or in the primer regions). It leads to a situation when, theoretically, you get a cluster with 21 members each with 2 RPM being exported to the text file, while a cluster with 19 members each with 1000 RPM is not exported (Export.MinimalClusterSize = 20), despite the second cluster being probably much more important. Don't you think that this could be misleading, and result, for once, in confusion, and secondly - in missing important high frequency clusters? I would love to see another export file (or an option) with total clusters RPM calculated and only seed sequences exported.

I hope this motivates you rather than discourage to develop the software! Thanks a lot! :)

PJ

drivenbyentropy commented 6 years ago

Hi,

The clustering algorithm does not differentiate between aptamers with identical randomized region but different primers. In fact, AptaCLUSTER does not take any primer regions into account during its clustering procedure.

However, AptaSUITE itself does store these two sequences as distinct molecules. This is by design and has the following reasoning. Since primers are added prior to incubation with the target and therefore present during the selection, there is a non-zero probability that the mutation in the primer region was introduced into the pool during the SELEX experiment itself (assuming that this mutation does not lead to loss of amplification) and that it is not a sequencing error. This mutant hence undergoes selection and competes with the remaining species in the pool. Since affinity and specificity are a function of both sequence and structure, we cannot treat the two aptamers as the same because that mutation might have lead to a conformational change. Hence any algorithm utilizing structural information (e.g. AptaTRACE) must treat these sequences as distinct entities to reflect their structural differences.

If you wish to avoid this distinction, you would have to manually preprocess your data and replace any mismatches in the primer regions (sorry).

As for the export functionality, you have raised a good point. I have extended the options according to your request and you can now define what 'cluster size' means. Currently the options are cluster diversity (corresponding to the old behavior) and cluster carnality (the sum over all aptamer sizes of that cluster). Please have a look at the Wiki which I have extended to reflect this new feature (parameter Export.ClusterFilterCriteria).

The new version containing the feature is v0.4.5.

Let me know if you have any additional comments.

Thanks!