LinearFold / LinearPartition

Linear-Time Approximation of RNA Folding Partition Function and Base Pairing Probabilities
Other
28 stars 18 forks source link

How to predict secondary structures for many RNA sequences from a fasta file? #2

Closed Zjianglin closed 3 years ago

Zjianglin commented 3 years ago

Hi, It seems LinearParition could only read from pipe stream. I have many sequences (some <1000bp and some vary from 5000bp to 12000bp), How can I quickly process them, such as directly read from fasta file?
It seems do not have such options. (LinearFold has a same problem.)

What's more, for sequences with much different length, what parameters should I use? (for example: beam size, MEA gamma, threashold .etc).

Thank you. I'm looking forward to your reply.

LinearFold commented 3 years ago

Hi @Zjianglin , thanks for your interest in LinearPartition. LinearPartition does have an option to read multiple sequences from file, and it is listed in README:

cat SEQ_OR_FASTA_FILE | ./linearpartition [OPTIONS]

Both FASTA format and pure-sequence format are supported for input.

Recommended parameters are: beam size 100, MEA gamma 1.5 and ThreshKnot threshold 0.3 for LinearPartition-V; and beam size 100, MEA gamma 3 and ThreshKnot threshold 0.2 for LinearPartition-C. These parameters are not sensitive to sequence length, so you can use them for all.

Thanks!

Zjianglin commented 3 years ago

Hi @LinearFold , thanks for your reply.

I tried cat SEQ_OR_FASTA_FILE | ./linearpartition [OPTIONS] for some demo sequences. However, the linearpartion (and linearfold) seems process sequences "line by line". As is shown below:

$ cat demo.fa 
>MT354616_UTR5
AGAUUUUCUUGCACGUGCGUGCGAUUGCUUCAGACAGCAGUAGCAGCGGCAGAGUUGGCA
GAGAGACUUACUCACGUCGACCAGUCGUGAACGUGUUGAGGAAAAGACAGCUUAGGAGAA
CAAGAGCUGGGA
>MT354615_UTR5
AGAUUUUCUUGCACGUGCGUGCGCUUGCUUCAGACAGCAAUAGCAGCGGCAGGUUUGGUG
GAGGGAAUUGCCCGCAUCAGCCAGUCGUGAACGUGUUGAGAAAAAGACAGCUUAGGAGAA
CAAGAGCUGGGG
###############################
$ cat demo.fa | linearfold -V 
>MT354616_UTR5
AGAUUUUCUUGCACGUGCGUGCGAUUGCUUCAGACAGCAGUAGCAGCGGCAGAGUUGGCA
.....(((.(((.(((...(((.((((((......)))))).))))))))))))...... (-15.70)
GAGAGACUUACUCACGUCGACCAGUCGUGAACGUGUUGAGGAAAAGACAGCUUAGGAGAA
......((((..(((((..((.....))..))))).)))).................... (-9.50)
CAAGAGCUGGGA
............ (-0.00)
>MT354615_UTR5
AGAUUUUCUUGCACGUGCGUGCGCUUGCUUCAGACAGCAAUAGCAGCGGCAGGUUUGGUG
.......(((((.(((...(((..(((((......)))))..)))))))))))....... (-16.40)
GAGGGAAUUGCCCGCAUCAGCCAGUCGUGAACGUGUUGAGAAAAAGACAGCUUAGGAGAA
(.(((.....))).).........((.(((.(.((((........))))).))).))... (-8.50)
CAAGAGCUGGGG
............ (-0.00)
#############################################
cat demo.fa | linearpartition -V -M
>MT354616_UTR5
Free Energy of Ensemble: -17.32 kcal/mol
AGAUUUUCUUGCACGUGCGUGCGAUUGCUUCAGACAGCAGUAGCAGCGGCAGAGUUGGCA
.....(((.(((.(((...(((.((((((......)))))).))))))))))))......

Free Energy of Ensemble: -10.50 kcal/mol
GAGAGACUUACUCACGUCGACCAGUCGUGAACGUGUUGAGGAAAAGACAGCUUAGGAGAA
......((((..(((((..((.....))..))))).))))....................

Free Energy of Ensemble: -0.05 kcal/mol
CAAGAGCUGGGA
............

>MT354615_UTR5
Free Energy of Ensemble: -17.28 kcal/mol
AGAUUUUCUUGCACGUGCGUGCGCUUGCUUCAGACAGCAAUAGCAGCGGCAGGUUUGGUG
.......(((((.(((...(((..(((((......)))))..))))))))))).......

Free Energy of Ensemble: -9.99 kcal/mol
GAGGGAAUUGCCCGCAUCAGCCAGUCGUGAACGUGUUGAGAAAAAGACAGCUUAGGAGAA
..(((.....)))((....))...((.(((.(.((((........))))).))).))...

Free Energy of Ensemble: -0.05 kcal/mol
CAAGAGCUGGGG
............

It predicted RNA structures for sequences line by line. But I want to known the secondary structure for the overall sequence, especially for some long sequences. Did I use the wrong options? Or, How could I get the integral structure for my sequences? Thank you.

LinearFold commented 3 years ago

Thanks for your suggestions @Zjianglin , we'll update the code base to allow such input shortly.

LinearFold commented 2 years ago

Now fasta input is supported.