Adding templates for visualization of ITS2 secondary structures for evolutionary comparison

janstrauss1 commented 3 years ago

Dear @AntonPetrov et al,

I am trying to use R2DT to visualize secondary structures of non-coding internal transcribed spacer 2 (ITS2) sequences in a standard layout for evolutionary/phylogenetic analyses (e.g. see Refs. https://doi.org/10.1016/S0168-9525(03)00118-5, https://doi.org/10.3390/ijms21176395).

Yet, it appears that the R2DT template library doesn't contain any manually curated templates representing ITS2 secondary structures, because I keep receiving the error that The sequence did not match any of the templates. using the R2DT web application?!

For instance, this happens when I use the GenBank accession AF457015 or ITS2 sequences extracted from the ITS2 Database at http://its2.bioapps.biozentrum.uni-wuerzburg.de/ (Ref. https://doi.org/10.1093/molbev/msv174).

Therefore, I wonder if it would be possible to add templates for ITS2 secondary structures including the common core ITS2 secondary structure (e.g. see Refs. https://doi.org/10.1093/nar/27.23.4533, https://doi.org/10.1261/rna.7204505) to allow for consistent visualizationf of short (<500 bp) ITS secondary structures using R2DT?

Some manually curated (and partially experimentally supported) ITS2 ring-pin model structure could be extracted from the following resources: https://doi.org/10.1016/j.tig.2015.01.002 https://doi.org/10.1038/s41467-017-00761-8 https://doi.org/10.3390/ijms21176395 https://doi.org/10.1038/s41396-021-00989-9 (may also be helpful: https://doi.org/10.1038/nature26156)

Would be happy to hear how you feel about this.

Many thanks in advance for any feedback!

AntonPetrov commented 3 years ago

@janstrauss1 Thank you for your interest in R2DT Jan!

The R2DT library indeed does not have any ITS2 templates. I will look into the papers you mentioned to see if they contain any structural alignments that could be used to create one or more Rfam family for ITS2. Do you happen to know of any resources or experts who could help with that?

Alternatively, we could just create a few templates for some specific ITS2 instances by hand and hope that this is enough. Creating an Rfam family would provide a more comprehensive coverage though.

janstrauss1 commented 3 years ago

Hi @AntonPetrov, many thanks for your quick reply!

I guess a good resource would be the ITS2 database that is available at http://its2.bioapps.biozentrum.uni-wuerzburg.de/.

Also I think that people from the Department of Bioinformatics, University of Wuerzburg (https://github.com/BioInf-Wuerzburg) including @greatfireball, @iimog, @chiras as well as the lab of Mathias Wolf, who develop/maintain the database and do alignments and RNA secondary structures (especially of the ITS2), are probably the best experts who might be able to help with building an Rfam family for ITS2.

Hope this helps.

iimog commented 3 years ago

Hi there :wave: I used to work on an ITS2 database update a couple of years ago, still I'm not an ITS2 expert. Mathias Wolf most certainly is. You can use the ITS2 database web interface to collect arbitrary sequences in the sequence pool (and even add your own) and create sequence structure alignments for the pool. Maybe this helps to generate profiles for certain taxonomic groups.

ITS2 workbench

AntonPetrov commented 3 years ago

Thanks a lot @iimog and @janstrauss1! I will get in touch with Mathias Wolf and discuss the creation of an Rfam family for ITS2. Once it's done, generating an R2DT template should be straightforward.

I will post any updates here. Hopefully we can make fast progress on this 🤞

AntonPetrov commented 3 years ago

I had a chat with Matthias Wolf who pointed out that ITS2 evolve very quickly and one would have to have lots and lots of different models to capture the diversity.

I still think that one could build a relatively small number of Rfam ITS2 models but their consensus 2D structures would be significantly underfolded compared to the individual sequences, and it sounds like the ITS2 website would do a better job generating 2Ds for individual ITS2 sequences than R2DT.

@janstrauss1 Jan, if you are interested in visualising some specific sequences I am happy to look into this further and make some custom R2DT templates for you. Otherwise I am leaning towards closing this issue in favour of using ITS2 website for this specific class of RNAs.

janstrauss1 commented 3 years ago

@AntonPetrov, many thanks for following up!

I fully understand that it's probably too difficult to implement the number of models sufficient to capture the full diversity of ITS2 secondary structures.

Yet, I also think that it could still be very useful to implement a number of Rfam ITS2 models in R2DT to at least be able to consistently visualize the common core ITS2 secondary structure and some major derivations for major taxonomic groups (including model organisms) using R2DT. Table 1 of Ref https://doi.org/10.1093/nar/gkm233 gives a good overview of taxonomic groups that could potentially be interesting to build Rfam ITS2 models to capture a wide range of 2D structures.

I guess what should definitely be included are models for the partially experimentally supported ITS2 ring-pin model structure of yeast (see Refs. https://doi.org/10.1016/j.tig.2015.01.002; https://www.nature.com/articles/s41467-017-00761-8/figures/6.

For my own research I'd be most interested in visualising ITS2 secondary structures from marine protists, particularly prasinophyte green algae (e.g. see Refs https://doi.org/10.1016/j.protis.2017.09.002; https://www.nature.com/articles/s41396-021-00989-9/figures/5; https://doi.org/10.1093/nar/gkm233).

Please let me know how many different Rfam ITS2 models/R2DT templates you think would be possible for you to implement and how to best proceed?

AntonPetrov commented 3 years ago

@janstrauss1 It looks like for comprehensive coverage Rfam would need at least as many models as there are rows in Table 1. Building that many ITS2 structural alignments would be a big project requiring some expert help.

Why don't we start with creating a ring-pin yeast model and maybe a green algae model for your specific case? If you have an alignment of your favourite ITS2 sequences with a consensus secondary structure in Stockholm format, that would really speed things up.

I will be away next week but can look into it again soon after I am back. Thanks again for your interest in Rfam and R2DT!

janstrauss1 commented 3 years ago

@AntonPetrov, many thanks for your feedback! That sounds very sensible to be.

Yet, I'm not that familiar with Stockholm format alignment, yet. Could you thus maybe give some directions how to best generate an alignment of ITS2 sequences with secondary structure in Stockholm format? Which tools would you recommend/ do you use in your own workflow?

I'm currently thinking of making multiple sequence alignments using MAFFT and then use RNAalifold for consensus secondary structure format...

janstrauss1 commented 3 years ago

@AntonPetrov, managed to do some multiple sequence alignments of my favourite green algae ITS2 sequences using MAFFT and generate a consensus secondary structure with RNAalifold.

Not really sure though if it's already in valid Stockholm format as you need it. Anyways, how can I best send the files to you to have a look?

I will be away for the next week but will be able to look into it again in two weeks.

Thanks again for your help and support!

AntonPetrov commented 3 years ago

Thank you Jan! Your approach sounds sensible to me. Another way would be to download the aligned sequences and structures from the ITS2 database and manually construct a consensus structure.

Could you please paste a link to the alignment here in case other folks want to take a look as well? I will check out the files when I am back on Monday. Thanks again!

janstrauss1 commented 3 years ago

Many thanks for your feedback @AntonPetrov!

I'm back online and have just created a new repository to share some files for this issue.

You can find my current MAFFT alignment file at ITS2_aln.fas and my RNAalifold output file at ITS2_alifold.out.

Yet, as I mentioned my experience with this is limited and I'm not fully convinced by the RNA secondary structure predictions from RNAalifold but hope it'll help to move things forward?!

Your suggestion to download aligned sequences and structures from the ITS2 database and manually construct a consensus structure sounds very intriguing. However, I'm not fully clear how to best do this. Do you maybe have some useful information/tutorial with some examples/ best practices to manually curate consensus structures that you could share?

AntonPetrov commented 3 years ago

Thanks Jan! I had a look at the .fas and .out files and did not immediately see how to convert them to Stockholm format that is needed for Rfam. Here is an example file showing what I am after.

Regarding the ITS2 database, I found the Jove video very informative.

The basic idea is to select some ITS2 sequences using the taxonomy tree browser, then drag them to the Pool area on bottom left, then click Analyze dataset and select Sequence & structure. The tool will show an alignment of all the selected sequences and their individual secondary structures in the dot bracket format:

The idea is that the sequences are now aligned and can be used for the Stockholm file directly, but the multiple individual structures need to be converted to a single consensus structure that reflects the structural features common to all family members. This would require some manual tinkering + the most important part is how to select the sequences for the alignment. If they are too diverse, the alignment won't be very good, and if they are too similar, then there is not enough useful variability. This is where your ITS2 expert judgement would come into play!

nawrockie commented 3 years ago

@AntonPetrov and @janstrauss1 : the esl-construct program installed with HMMER or Infernal (http://eddylab.org/infernal/) using the -x option can make a consensus structure from individual secondary structures but you'd need the alignment in Stockholm format with individual sequence SS annotation first.

I also have some scripts that I used to make consensus structure annotated rRNA alignments from CRW data, but they are pretty specific to the CRW formats (e.g. bpseq files) https://github.com/nawrockie/crw-conversion-tools

All that said, I may be able to help convert the data shown on the screenshot above to a consensus secondary structure annotated Stockholm alignment, but would require some type of text file output, is it possible to export the data from the ITS2 database into aligned fasta or dot-bracket format of some kind?

janstrauss1 commented 3 years ago

@AntonPetrov and @nawrockie, many thanks for your super helpful feedback!

After some google searching, hoping to find a good approach on how to build structurally annotated RNA multiple sequence alignments in stockholm format, I was actually already looking at the related esl-reformat from HMMer yesterday and considered using it to convert fasta alignments to stockholm format. Yet, I wasn't really sure how to best include the structural information apart from adding it manually, and as I understand from your comments, there doesn't seem to be a convenient tool or generally recommended way on how to accomplish this.

Anyways, I was just tinkering with the ITS2 database and able to export some example text file output from data as shown in the screenshot above that contains ITS2 sequences and dot-bracket annotations. Unfortunately, the database seems to be currently down though, so I wasn't able to further explore and select more variable sequences.

Yet, @nawrockie, if you would be able to help with converting such example text file output from the ITS2 database to a consensus secondary structure annotated Stockholm alignment that would be great!

nawrockie commented 3 years ago

@janstrauss1 are you able to export the sequences with gaps as an alignment? It looks to me like the example text file you provided is not an alignment (unless they are all the same length and there are zero gaps).

janstrauss1 commented 3 years ago

@nawrockie, yes, sorry, I already realized that the initial example text file that I uploaded wasn't a good example! The sequences where indeed highly similar (if not identical) and of same length so that the alignment contained no gaps.

Since, I have been able to resolve my intermittent problems to access the ITS2 Database, I have been able to upload a new example structure-sequence alignment file that I exported from the database. It should make a better example.

Hope you'll be able to convert such text file output to a consensus secondary structure annotated Stockholm alignment.

nawrockie commented 3 years ago

@janstrauss1 - thanks! I should be able to look into this next week.

nawrockie commented 2 years ago

Hi @janstrauss1

I wrote a simple perl script that converts your ITS2db_prasinophytes_example_alignment.txt file to Stockholm format with individual sequence SS annotation and attached it here. its2-to-stk.pl.txt

(Note that I had to add .txt suffix to get github to allow me to attach it so you may want to remove the .txt before using.)

The formatting of the output is not pretty, but if you pipe it into esl-reformat it will format it nicely. The esl-reformat is a 'miniapp' from the Easel sequence analysis library that is installed with HMMER (http://hmmer.org/) or infernal (http://eddylab.org/infernal/).

For example:

perl its2-to-stk.pl ITS2db_prasinophytes_example_alignment.txt | esl-reformat -informat pfam pfam - > ITS2db_prasinophytes_example_alignment.stk

You can then use that stockholm file as input to esl-construct (another easel miniapp) to calculate consensus secondary structures using different options. For example:

esl-construct -x -o ITS2db_prasinophytes_example_alignment.consensus.stk ITS2db_prasinophytes_example_alignment.
stk

You may want to play around with esl-construct and look at the available options with esl-construct -h and read the esl-construct man page (installed with infernal here: infernal-1.1.4/easel/miniapps/esl-construct.man).

Let me know if you have questions. I hope this helps!

janstrauss1 commented 2 years ago

Hi @nawrockie,

many thanks for your great help! That works very nicely!

Yet, for future reference, to make the code for the pipe of the its2-to-stk.pl script and esl-reformat work, the esl-reformatoptions parameter -informat need to be changed to --informat to avoid any Failed to parse command line: No such option "-i" error messages.

Not sure if it makes a difference when installing esl-reformat with infernal but the following code worked for me (installed esl-reformat with HMMER):

perl its2-to-stk.pl ITS2db_prasinophytes_example_alignment.txt | esl-reformat --informat pfam pfam - > ITS2db_prasinophytes_example_alignment.stk

janstrauss1 commented 2 years ago

@AntonPetrov, I've uploaded an example Stockholm alignment file with consensus structure as created by the approach provided by @nawrockie using the its2-to-stk.pl perl script together with esl-reformat and esl-construct from the easel library.

Could you please confirm if this is in the Stockholm format that is needed and easily utilized by Rfam?

I will then work on making a better selection of sequences for the structure-sequence alignment for my specific case to create an R2DT template.

AntonPetrov commented 2 years ago

Thanks Eric and Jan! The Stockholm file works and when I ran it through the Rfam pipeline, it generates the following results: http://preview.rfam.org/searches/ITS2.html

If you set the threshold to around 90 bits, you will see a nice Micromonas-specific alignment.

This would not be a good Rfam family as it has a very narrow taxonomic distribution, but as this is just a proof of principle, it is a success 🎉

Ideally the sequence IDs need to refer to an INSDC database like ENA or GenBank, so instead of 65427922_Micromonas_pusilla we would need MT117943.1/2056-2200 which I found using NCBI Blast.

janstrauss1 commented 2 years ago

Great 👍 , thanks for the feedback @AntonPetrov!

So I will think about and start working on selecting a better range of sequences to construct a Rfam family. May take me a little bit of time though.

I will also use versioned GenBank Accessions with locations like MT117943.1/2056-2200 as you suggest. Actually, just for general information, the current sequence IDs like 65427922_Micromonas_pusilla as obtained from the ITS2 Database already include GenBank gene identifiers (GI accessions) and should be findable by searching GenBank for the GI number.

Many thanks again for the great support and I will get back as soon as I have data to create a new Rfam family!

AntonPetrov commented 2 years ago

Great, thanks @janstrauss1! You are right, I did not realise that those numbers were GIs - that's convenient.

Looking forward to your alignment (or alignments). It would be great to have some ITS2 coverage in Rfam. Let me know if you need anything else and many thanks for working on this!

RNAcentral / R2DT

Adding templates for visualization of ITS2 secondary structures for evolutionary comparison #52