Rfam / rfam-docs

Rfam documentation and help hosted on ReadTheDocs
https://docs.rfam.org
Apache License 2.0
6 stars 9 forks source link

How to get SS_cons when building my own cm file? #58

Open wonwill opened 2 weeks ago

wonwill commented 2 weeks ago

I am trying to analyze tRNAs of non-model organisms (parasites) because I found out that Rfam doesn't include the tRNA sequences of my interest. I aligned all sequences of my interest and then tried to convert it to Stockholm format having WUSS notations. However, I can't find how to convert WUSS notations and then add SS_cons.

This case has been quite challenging for me over the past week. Could you help me out with this?

nancyontiveros commented 2 weeks ago

Hi Wonwill,

I understand that you have an alignment and a separate secondary structure (SS) in WUSS format, and you want to combine them.

Here are some options depending on your specific goal:

We recommend using Infernal https://github.com/brendanf/inferrnal for this task. You'll likely need a FASTA file containing your sequences and the Rfam family CM file.

Use R-scape https://github.com/EddyRivasLab/R-scape to retrieve both the alignment and the SS_cons with the most covaring base pairs represented.

If either of these scenarios describes your situation, please let me know.

For further assistance:

Clarify your goal: Briefly describe your SS source. Provide an example: A snippet of your alignment and the SS_cons, and CM file (if applicable) would be very helpful.

wonwill commented 2 weeks ago

Thanks for your kind and quick reply.

a separate secondary structure (SS) in WUSS format

Actually, not. I have only an alignment file (ClustalW). I don't know how to get individual or consensus SS in WUSS format.

I am trying to analyze the tRNA of Echinostoma sp. (Parasitic fluke) as the attached example file. E_trnaA_v2.aln.txt

  1. My first goal is to build my own Echinostoma tRNA CM file.
  2. The second goal is to add my own CM file to a typical RFam CM file.
  3. I would like to compare two results because it is currently difficult to annotate my mitogenome.
nancyontiveros commented 2 weeks ago

It seems like you're aiming to predict secondary structures (SS_cons) for your specific alignment, rather than simply annotating the existing Rfam families SS_cons.

Here are two potential approaches:

R-scape: This tool is excellent for predicting secondary structures from alignments. You can input your alignment and R-scape will generate the corresponding SS_cons.

RNAalifold: https://www.tbi.univie.ac.at/RNA/RNAalifold.1.html This tool from the ViennaRNA package can also be used for secondary structure prediction, especially for small alignments.

Given that tRNA secondary structures are highly conserved, I'd recommend relying on the Rfam SS_cons for accurate predictions. You can use Infernal to create a Stockholm file containing both the alignment and the Rfam SS_cons.

cykim1029 commented 2 weeks ago

Thanks for your comment.

We tested two approaches using the Rfam alignment having only 4 sequences. https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/full_alignments/RF04010.sto The SS_cons was removed from the original file. RF04010_without_SS_cons.sto.txt This file was used for our test. However, the problem still happens.

1) R-scape output: postion X, WUSS O The test data output showed different SS_cons from the original Rfam file (under E-value of 0.05).

#RF04010.sto
:::<<<<<<<-<<<<<<<<<---<<<<<<<<<<<<<____>>>>>>>>>>>>>--->>>>>>>>>->>>>>>>:::
:::<<<<<<<-<<<<<<<<<<<<<<<<<<<<<<<<______>>>>>>>>>>>>>>>>>>>>>>>>->>>>>>>:::
#test data

2) RNAalifold output: postion O, WUSS X SS_cons was generated in dot-bracket only, Not WUSS notation. The SS positions seem to be identical between two alignments. Unfortunately, we have no idea for conversion from dot-bracket to WUSS format.

#RF04010.sto
:::<<<<<<<-<<<<<<<<<---<<<<<<<<<<<<<____>>>>>>>>>>>>>--->>>>>>>>>->>>>>>>:::
...(((((((.((((((((((...((((((((((((....))))))))))))...)))))))))).)))))))...
#test data

How can we get the identical SS_cons outputs between two alignments?

nancyontiveros commented 2 weeks ago

Hi both,

I'm currently on leave today, I'll review the alignment and tests you performed tomorrow and come back

From what I can see, each of your secondary structures (SS) represent a stem-loop, which is consistent with the SS_cons. While this isn't necessarily incorrect, I understand your look for an identical SS.

If a video call would be helpful, I'm happy to hop on a video call if that's easier for explaining what you are looking for. Just let me know.

Nancy

nancyontiveros commented 1 week ago

Hi,

Regarding

1) R-scape output.

R-scape tends to predict a large number of covarying base pairs, including non-canonical pairs and coaxial interactions. This can sometimes exceed the level of complexity found in Rfam families. To align with Rfam's criteria, you might use the --Rfam option in R-scape.

I've tested R-scape on your 4-sequence Hydra vulgaris alignment, using the Rfam SEED and the --Rfam option. Please find the results attached.

# STOCKHOLM 1.0
# Rfam RF04010 SEED

URS000075BBC7_6087/2-77 AGCTATTTATATCATTACCAAATTGCACACTTAACATAACTGTTAAATGTGCAAACTGGTAATGAAATAAATACTG
URS0000D558DF_6087/1-76 TGCTATTTATATCATTACCAGTGTGCACATTTAACATTATTGTTAAATGTGCGAACTGGTAATGAAATAAATACTG
#=GC SS_cons            :::<<<<<<<-<<<<<<<<<<<<<<<<<<<<<<<<______>>>>>>>>>>>>>>>>>>>>>>>>->>>>>>>:::
#=GC RF                 uGCuauuuauAuCauuaCCaauuuGCaCauuuaaCaUuAuuGuuaaauGuGCaAACuGGuaauGaAauaaauaCUG
//

# STOCKHOLM 1.0
# Hydra vulgaris

#=GS NW_004171009.1/56862-56787 DE Hydra vulgaris strain 105 unplaced genomic scaffold, Hydra_RP_1.0 HYDRAscaffold_35197, whole genome shotgun sequence
#=GS NW_004170022.1/71698-71773 DE Hydra vulgaris strain 105 unplaced genomic scaffold, Hydra_RP_1.0 HYDRAscaffold_36186, whole genome shotgun sequence
#=GS NW_004170022.1/71773-71698 DE Hydra vulgaris strain 105 unplaced genomic scaffold, Hydra_RP_1.0 HYDRAscaffold_36186, whole genome shotgun sequence
#=GS NW_004171009.1/56787-56862 DE Hydra vulgaris strain 105 unplaced genomic scaffold, Hydra_RP_1.0 HYDRAscaffold_35197, whole genome shotgun sequence

NW_004171009.1/56862-56787 AGCUAUUUAUAUCAUUACCAAAUUGCACACUUAACAUAACUGUUAAAUGUGCAAACUGGUAAUGAAAUAAAUACUG
NW_004170022.1/71698-71773 UGCUAUUUAUAUCAUUACCAGUGUGCACAUUUAACAUUAUUGUUAAAUGUGCCAACUGGUAAUGAAAUAAAUACUG
NW_004170022.1/71773-71698 CAGUAUUUAUUUCAUUACCAGUUGGCACAUUUAACAAUAAUGUUAAAUGUGCACACUGGUAAUGAUAUAAAUAGCA
NW_004171009.1/56787-56862 CAGUAUUUAUUUCAUUACCAGUUUGCACAUUUAACAGUUAUGUUAAGUGUGCAAUUUGGUAAUGAUAUAAAUAGCU
#=GC SS_cons               :::<<<<<<<-<<<<<<<<<<<<<<<<<<<<<<<<______>>>>>>>>>>>>>>>>>>>>>>>>->>>>>>>:::
//

RF04010_test_Hydra_vulgaris.tar.gz

nawrockie commented 1 week ago

@cykim1029 : infernal's cmbuild program does not require the SS_cons annotation be in WUSS format. The dot parentheses notation will work just fine.

To build a CM starting with your clustal alignment, do the following steps:

  1. install infernal 1.1.5 (http://eddylab.org/infernal/)

  2. convert your clustal alignment to 'pfam' format (a special kind of stockholm format in which each sequence is on a single line - this will make it easier to add your SS_cons notation in step 3), by running the following command: infernal-1.1.5/easel/miniapps/esl-reformat pfam in.clustal > out.pfam

  3. Manually modify the out.pfam file in a text editor by adding your dot parantheses structure prediction from RNAalifold after the final sequence line (but before the // line) with the prefix #=GC SS_cons, like this: #=GC SS_cons ...((((...))))... And then save the file. You could also add the R-scape predicted structure here. It is not a big problem that this is different from the Rfam structure in the RF04010 test you performed. As @nancyontiveros wrote above, R-Scape may predict additional or different basepairs from what the Rfam structure has, this is because predicting the correct structure is a difficult problem and different approaches will yield slightly different results. Rfam uses structures from different sources and from a variety of prediction methods.

  4. Build a CM with the command: infernal-1.1.5/src/cmbuild -F my.cm out.pfam The output should indicate that there are N>0 basepairs in your model in the bps field.