Visualising Rfam alignments

AntonPetrov commented 10 months ago

Hi Fabrice!

I tried visualising an Rfam alignment with the following command: java -jar /rna/rnartistcore-0.4.4-SNAPSHOT-jar-with-dependencies.jar /rna/root_folder/rfam/rna.kts and while I get the diagram, it generates lots of warnings if I don't provide a name in my rna.kts:

rnartist {
  png {
    path = "media/"
  }

  svg {
    path = "media/"
  }

ss {
  rfam {
    id = "RF01725"
    name = "URS0000D65C24_12908/1-104"
  }
}

  theme {
    details {
      value = 2
    }
  }
}

Is there a way to generate a layout for the consensus Rfam sequence, or should I be generating a mini Stockholm file with just the consensus sequence and the SS_Cons line?

Thank you for looking into this!

fjossinet commented 10 months ago

Hi Anton,

you can if you set the name to "consensus" in the rfam element. This is not enough highlighted in the documentation, I will fix it. You can see it in one of the script samples concerning the Stockholm format.

While testing your script, I realized that I did not checked everything related to Rfam since a while. The value of the name property is not used to produce the output files. So you get null.png and null.svg. I will fix it.

Here is the output for your script with the name set to "URS0000D65C24_12908/1-104":

And here is the output with the name set to "consensus":

Hope this help.

AntonPetrov commented 10 months ago

Yes, the consensus bit work great, thank you!

Just curious what do you mean by consensus for Rfam? I would've imagined it to be the RF line but then the image would look closer to:

I do really like how RNArtist adds spacing before the second helix to avoid an overlap, this is beautiful! 👏

fjossinet commented 10 months ago

An Rfam consensus for RNArtist is the 2D defined in the GC SS_cons line. This allows me to easily color the orthologuous structural domains (junctions and helices) and highlight their evolution between species.

For example to color the first apical loop in red for RF01725, I'm using its location in the consensus 2D (96 to 131). In my script, I'm adding the instruction "use alignment numbering" in the Rfam element to let RNArtist understand that.

Consequently, this script:

rnartist {
  png {
    path = "media/"
  }

  svg {
    path = "media/"
  }

ss {
  rfam {
    id = "RF01725"
    use alignment numbering
  }
}

  theme {
    details {
      value = 3
    }

    color {
        value = "red"
        type = "N"
        location {
           96 to 131
        }
    }
  }
}

will produce a 2D for each RNA in the alignment and for the consensus. Here is the result for the consensus:

RF01725_consensus

And that's the result for two RNA species (URS0000AB1A0E_12908 and URS0000D68715_12908):

RF01725_URS0000AB1A0E_12908_1-93

RF01725_URS0000D68715_12908_1-142

And this is why you get some errors when the script is computed. For some species, some domains could disappear according to this consensus and the drawing algorithm is lost. I have fixed that and it worked well before. But I had to inactivate this fix due to some recent modifications in my code. I will reactivate it.

AntonPetrov commented 10 months ago

Okay, I understand. Thank you Fabrice!

I can get what I am after by taking the reference RF and SS_CONS lines to make a mini Stockholm file and then for RF01725 I get this diagram:

rfline

This diagram is a perfect representation of an Rfam family because it captures the consensus structural elements, does not have any overlaps, and it shows the location of consensus nucleotides. I will explore using these drawings as R2DT templates.

For reference, here are my `rfline.sto` and `rfline.kts` files

`rfline.sto`: ``` # STOCKHOLM 1.0 #=GF DE test #=GF TP test rf_line aaaaagCAUcAAGAGagGgucuaAagacCcCgcCAACCuggcuaaccagccAaGGUGgcuaaAaagAUGcGGcccaaaagggCAAaaaAaggaguuaaa #=GC SS_cons :::::(((((,,,,<-<<<<<____>>>>>><<<--<<<<<<_AAAAA>>>->>>->>>,,,,,,)))))-<<<<____>>>>:::::::aaaaa:::: // ``` and `rfline.kts`: ``` rnartist { png { path = "media/" } svg { path = "media/" } ss { stockholm { file = "/rna/root_folder/rfam/rfline.sto" } } theme { details { value = 3 } } } ```

As a side note, I was a bit confused by the consensus flag in RNArtist because the Rfam consensus technically is the RF line. See Infernal manual for example:

The #=GC RF line is Stockholm-format reference coordinate annotation, with a residue marking each column that the CM considered to be consensus.

The drawings produced with the consensus flag are perhaps better described along the lines of full-width or all-columns. Such drawings are very useful because they include every single column of the alignment but I would argue that they are not fully representative of the alignment because they may contain large loops that are present only in a handful of individual sequences and are most likely underfolded.

I am closing this issue and thank you again for getting back to me so quickly! 🚀

fjossinet commented 10 months ago

I agree with you concerning the RNArtistCore flag consensus which is confusing. all-columns is a good choice. I will change the code to make consensus in agreement with the Infernal manual. The drawings produced with this all-columns mode allow me to quickly see the location of those "additional domains" present only in a handful of individual sequences. Btw, is it not possible for Rfam to search for new helices in such domains in order to extend the consensus?

AntonPetrov commented 10 months ago

all-columns is a good choice. I will change the code to make consensus in agreement with the Infernal manual.

Awesome, thank you!

The drawings produced with this all-columns mode allow me to quickly see the location of those "additional domains" present only in a handful of individual sequences.

Cool, it does make sense!

is it not possible for Rfam to search for new helices in such domains in order to extend the consensus?

I can't speak for Rfam anymore but in general Rfam is all about families and defining family-level characteristics, and some structural elements are just not conserved at family level.

To deal with these sequence-specific structures, in R2DT we introduced a constrained folding mode that uses Rfam structure as a constraint and predicts the rest of the structure. It works well but it requires running ViennaRNA in the background. At least it is easy to install with pip now, which works well for R2DT, as it is written in Python.

fjossinet / RNArtistCore

Visualising Rfam alignments #5