dsurujon / DNA_FT

Fourier Transform of a DNA sequence
1 stars 2 forks source link

Using DNA_FT.py #1

Open anandksrao opened 3 years ago

anandksrao commented 3 years ago

Could you please provide the following information:

  1. example syntax for DNA_FT.py
  2. example input and output file pair
  3. list of dependencies for installation (preferably using downloads all from anaconda.org)

My run STDOUT is shown below, hence these requests listed above. Thanks in advance.

(base) MacBook-Pro:DNA_FT-master anand$ ./DNA_FT.py example.fasta 
./DNA_FT.py: line 1: import: command not found
./DNA_FT.py: line 2: import: command not found
./DNA_FT.py: line 3: import: command not found
from: can't read /var/mail/numpy.fft
from: can't read /var/mail/pylab
./DNA_FT.py: line 6: import: command not found
./DNA_FT.py: line 7: import: command not found
./DNA_FT.py: line 10: syntax error near unexpected token `('
./DNA_FT.py: line 10: `f=open(filename) '
dsurujon commented 3 years ago

I can answer 1 here - you need a preceding "python" before the command, e.g. python ./DNA_FT.py example.fasta
This is a pretty old piece of code, so I'll rewrite it and add some documentation so it's actually usable. I wrote this in python 2.7 (which is pretty outdated now), and thinking of updating it to python 3. Do you happen to have python 2 or 3?

anandksrao commented 3 years ago

Thanks for your reply. I happen to have both python 2 and 3, please see below for details.

MacBook-Pro:DNA_FT-master anand$ python --version
Python 2.7.13 :: Anaconda custom (x86_64)
MacBook-Pro:DNA_FT-master anand$ python3 --version
Python 3.8.3

Thanks even more for offering to add documentation and re-writing it to make it more usable. I truly appreciate it.

Going beyond conversion of DNA sequences to FT plots, I have some questions for you please. I am asking you, because I think you understand 2 very different subject areas - digital signal processing and biological sequences, which is rare :)

Would you happen to know if tools exist for the following tasks, and if yes, from where I can download and use them?

  1. Convert DNA sequence to a digital signal (as your DNA_FT.py does)
  2. Align 2 such transformed digital signals, and assign a score
  3. Visualize aforementioned alignment that can be interpreted by the naked eye
  4. Use steps 1-3 to determine whether the alignment score is better than a user-defined threshold to consider the signal pair to be 'similar'
  5. If steps 1-4 are possible, then can't such analyses be routinely used to discover sequence matches in large genomic DNA against shorter DNA sequence queries?

Please note that steps 1-5 outlined above, are performed routinely by current bioinformatics tools, that can be broadly classified into three (related but different) approaches:

So I wonder whether for the purpose of for biological sequence matching and discovery:

  1. signal processing tools can equal the performance of existing bioinformatics tools?
  2. whether signal processing approaches can surpass the performance of existing bioinformatics tools? If yes, for what applications?

Thanks!

dsurujon commented 3 years ago

Interesting, that's a very cool idea! A quick search got me to this paper on using euclidean distances between the spectra of a pair of sequences. Let me first fix up the main script that plots a single FT transformation, and I can also add a sequence-pair comparison :)
This alignment-free approach for sequence comparison would probably be better in terms of time&memory, if implemented well, since alignment based approaches like BLASTing are pretty resource-intensive. Another alignment-free approach is kmer set comparison (example - mash). I think what you consider "better" beyond a resource utilization perspective might depend on the context. What kinds of sequences are you comparing? Do you expect high levels of identity? Large differences in the presence/absence of genetic elements?

anandksrao commented 3 years ago

Thanks for those links... k-mer based sequence alignment, genome assembly etc are quite prevalent today in bioinformatics, as you may be aware. I do not think they would drastically improve anything beyond what is already possible today. And therefore I am way more interested in the potential for using DNA sequences transformed to digital signals, and their use in turn, for sequence alignment, with the explicit goal of discovering sequence matches.

Please note that in all bioinformatics analyses, discovery of matching sequence = 1st step, only then followed by clustering of newly discovery matches = 2nd step.

An example task may be: I have set of 1000 query DNA sequences - each 2-20K long. My search space (aka subject) is a constant 200M long DNA sequence. For each of my subject<->query pair, can I set up a way to dichotomously determine if they match, i.e. align well or not. If yes, then take all those positively matching queries and cluster them into sub-groups. If not, discard those queries from further analyses. Hope this example makes sense. One may simplify or make this example more complex, depending on what is a good jump-off point might be, to empirically test signal processing transformations and tools for this sort of a task.

I am certainly interested in exploring if and how digital signal processing (DSP) tools may be forgiving and still be able to detect partial signal matches even when 1 or more of the sub-sequences, that together help define what a sequence of interest is, may be missing. So yes, your question is very important - can DSP tools detect/discover matches despite large presence-absence variations? And how large is too large, and what is the lowest limit of sequence homology at which DSP tools will high unacceptably low sensitivity and/or specificity of match discovery?

Please allow me to pick your brains some more, by sharing my thoughts on 3 related but slightly different sub-topics:

1. Some thoughts on sequence ALIGNMENT BLAST uses an easy to understand model (aka substitution matrix) that determines the score rewarded when a position in an alignment matches versus the penalties when there is a mismatch due to a substitution, or insertion or deletion. The BLAST alignment and score are BOTH needed for an end-user to make sense of the result. Low score alignments are either not reported or ignored even when reported, and for alignments of interest, we examine it with the naked eye. Right?

I wonder how the average bioinformatics end-user will or can interpret an alignment of digital signals. Because we are trained to see how the sequence, which is a polymer, is aligned at the monomer level. So now, instead of alignment between 2 strings of As, C, Gs and Ts, if I see 2 digital signals that are aligned, am I able to interpret it? I am not sure because I've never seen one :)

Or should there be an additional step where poorly aligned digital signals are discarded, and well aligned digital signals are reverse transformed in some manner to a more traditional sequence alignment representation? I know DNA sequence can be transformed to a digital signal, and also inverse transformed the other way. Likewise, is such a transformation and its inverse possible even for an alignment? And your approach for digital signal alignments, do you expect it to work for both pairwise and multiple alignment tasks?

2. Some thoughts of sequence CLUSTERING In digital signal processing of DNA sequences used for clustering, based on alignment-free methods, I'd be curious to know how those results compare to clustering results obtained from text i.e. regular sequence based clustering tools such as CD-HIT or UCLUST etc. See this 2018 paper (link) as an example of a paper that discusses limitations of these approaches and how to improve upon them...

3. Is there an overlap in methods for digital signal ALIGNMENT vs. CLUSTERING? Digital signal clustering can be alignment free, does that mean all digital signal clustering algorithms are necessarily alignment free? If yes, then there would be NO overlap between digital signal alignment versus clustering methods - and by extension would that mean one has to come up with new digital signal alignment tools / algorithms, right? But If no, then what/ where is the overlap in methods for alignment vs. clustering of digital signals?

I have gathered a few of links about DNA sequence digital signal processing and it's application in clustering. I'd be happy to share those links if you are interested...

During this online research, I did not find any info on digital signal alignment, so I'd be very interested in exploring your DSP alignment tool when it is ready.

Finally, I wonder if we should take this discussion offline... In any case, I look forward to reading your responses and thoughts. Cheers!