[Question] Which structure is used for the plots in the paper

LinearFold / LinearTurboFold

An end-to-end linear-time algorithm for structural alignment and conserved structure prediction of RNA homologs

Other

11 stars 5 forks source link

[Question] Which structure is used for the plots in the paper #6

Closed waltergallegog closed 2 years ago

waltergallegog commented 2 years ago

Hello, I have run LTF with the 25 SARS-CoV-2 and SARS-related genomes, which produces:

file output.aln
files .ct and .db for each sequence

Then with the combine_results.py I obtain a ltf.out file with the alignment and structure for each sequence.

I would like to reproduce the secondary structure plots from the paper, for example the one in Figure 3.

Does this figure corresponds to the structure of one of the 25 sequences, or is the structure of a consensus?
Is it possible for you to share how you obtained the consensus structure (if applicable) ?
Is it possible for you to share which software you are using for the plots?

Thanks for the help.

sizhen commented 2 years ago

Hello,

Thanks for your interest in our work. The secondary structure shown in Figure 3A corresponds to the reference sequence (NC_004718.3) from LinearTurboFold prediction. And we used StructureEditor (download) to draw the plots. For the consensus structure, LinearTurboFold encourages sequences to fold similarly but still allows variable structural elements between sequences. So there is no consensus structure predicted by LinearTurboFold. While we found some conserved structures among all 25 sequences using compensatory mutations as signals (Table S2&S3).

Please let me know if you have further questions.

Sizhen

waltergallegog commented 2 years ago

Hello Thanks for the feedback.

So if I understood correctly.

The structure in Figure 3A is obtained from the reference sequence NC_004718.3, which in the samples25.fasta file corresponds to AY274119.3_Severe_acute_respiratory_syndrome-related_coronavirus_isolate_Tor2__complete_genome. So to get I can use StructureEditor using as input the .ct or .db file corresponding AY274119to that already have.

Then the additional info in the figure (like compensatory mutations) is obtained from the structure of the other sequences.

Walter

sizhen commented 2 years ago

Hello Walter,

Oh, sorry, the reference sequence of SARS-CoV-2 should be >NC_045512.2_Wuhan_seafood_market_pneumonia_virus_isolate_Wuhan-Hu-1__complete_genome. StructureEditor maybe can not handle such a long sequence (~30,000 nt), so I only draw UTR regions and interactions between them. Yes, the compensatory mutations information is extracted from the multiple sequence alignment predicted by LinearTurboFold.

Sizhen

waltergallegog commented 2 years ago

Ok got it Thanks again Best regards Walter

waltergallegog commented 2 years ago

Hello Sizhen, Sorry to reopen this issue, just one more question. I was able to draw the basic structure of NC_045512.2 using StructureEditor and a .db file containing only the 5' and 3' UTRs as in the paper. The only thing different is the index for nucleotides in the 3' UTR. In the paper you are able to conserve the correct index (29464 onwards), but as I'm using a .db file, this indexing is lost.

I tried using instead the .ct file, as the .ct contains the index explicitly, and keeping only the beginning and end of the file, but structureEditor complains about the indices after 29464 being out of range.

How did you manage to keep the correct indexes for the 3' UTR?

Thanks Walter

sizhen commented 2 years ago

Hey Walter,

What you do mean the indexing is lost? Could you provide me with more information (like screenshot or the db file) so I can try to fix it? Also, you can provide me the ct file as well, I can help to convert it to the dot-bracket format first. If you would like to convert it by yourself, you can try RNAstructure ct2dot program. But you need to hack the code and remove the length limitation first.

Yeah, StructureEditor can not handle such a long sequence, so I only draw the structure from the UTRs region and the interaction between them, so the input length would be small.

Thanks, Sizhen

waltergallegog commented 2 years ago

Hello Sizhen Sure, let me provide more info This is the structure I was able to draw. As in Figure3 A, on the top is the 5' UTR, on the bottom the 3' UTR. 17_NC_045512 2_Wuhan_seafood_market_pneumonia_virus_isolate_Wuhan-Hu-1__complete_genome_3and5

But as you can see, the indices labels in the 3' UTR are not correct. They do not correspond to the real ones that should start in 29464. Instead they start at 401 because to draw I used this db file: 17_NC_045512.2_Wuhan_seafood_market_pneumonia_virus_isolate_Wuhan-Hu-1__complete_genome_3and5.txt This db is just the first 400 bases of the full sequence followed by the last 440.

In order to have the correct indices, I tried using a .ct file like this: 17_NC_045512.2_Wuhan_seafood_market_pneumonia_virus_isolate_Wuhan-Hu-1__complete_genome_3_5.txt

But I get this error: StructureEditor_error

Even if the total number of bases is 840, it seems structureEditor does not like that some of those bases have an index over 10000.

How did you manage to have the correct indices labels for the 3' UTR?

Thanks Walter

sizhen commented 2 years ago

Oh, I got what you mean.

Actually, I modified indices in the UTR3 manually using Adobe Reader lol.

Sizhen

waltergallegog commented 2 years ago

oh ok, yeah is easier that way, I guess I'll do the same then Thanks Walter