Lattice-Automation / seqfold

nucleic acid folding
MIT License
79 stars 12 forks source link

Folding visualizations and UNAfold differences #5

Closed ryandikdan closed 3 years ago

ryandikdan commented 3 years ago

Hello and thanks so much for writing accessible code! It's well needed.

I am coming from using UNAfold online, which is luckily still around on IDT's website right now, but I worry that it won't necessarily be there forever. I'm also interested in coding with these programs so UNAfold is out, since the owner of it hasn't responded to anything and I can't find their code anywhere. But when I was using seqfold, I noticed a bunch of differences. Is there a good explanation for this? I trust you, but I'm just not familiar with exactly where all the ambiguity may be coming from. Do you believe that your numbers are more accurate?

Back onto using UNAfold, is there a feature, where I can see the folding that occurs? When I run the fold command in seqfold I just see the base pairing that will occur. Is this for the most likely structure? When I run UNAfold, I get the dG and base pairing for multiple different folded structures.

I also think a brief manual explaining what i and j are would definitely help a bit. Again thanks for all your code and have a nice day!

jjti commented 3 years ago

Do you have some examples of the differences, like which results were surprising? If not I pasted some in the examples section for ref too: https://github.com/Lattice-Automation/seqfold/blob/master/examples/dna.csv

At the highest level the differences are likely the result of different energy functions. There are these numbers, determined from experiments, that are fed in the ddGs: https://github.com/Lattice-Automation/seqfold/blob/master/seqfold/dna.py. Small variations in those add up over the course of the fold.

When I run the fold command in seqfold I just see the base pairing that will occur. Is this for the most likely structure?

Yes, that's for what seqfold things is the most likely/frequent/lowest energy structure

Re: showing the folds/dGs of multiple structures, I don't really have the bandwidth to add that feature, but it would involve traversing the energy matrix with some kind of heuristic to find other paths than the "optimal" one. IIRC the unafold authors published a paper where they described how they were doing it

I also think a brief manual explaining what i and j are would definitely help a bit

i and j are just two common variable names for when looping. My guess is that i was the most popular var name, short for index, and then j was used because it's alphabetically after. k is then used for triply nested loops. It also shows up one of the seminal papers in this space:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC350273/

At a high level though re: whether to use UNAfold or seqfold, I'd say that it comes down to what you're using it for. If you're doing research on single RNA/DNA molecules and you have access to UNAFold, I'd just go with that because everyone knows what it is. But both it and seqfold are making high-level, super imperfect guesses at secondary structure. Better predictions would need to be in 3D, not 2D, but that's computationally intractable, so all this DNA/RNA folding stuff is about finding a happy medium between speed and reasonable accuracy. I don't think that seqfold's results are different enough from UNAFold to worry about but, again, it depends on use case

jjti commented 3 years ago

Lmk if you want to talk more about whether seqfold fits you use-case, happy to talk/write more about it