alevar / chessRevisions

chess revision scripts and notebooks
GNU General Public License v3.0
3 stars 1 forks source link

How can I get ORF form the output of stringtie? #2

Open ZHIDIHUAYUAN opened 3 years ago

ZHIDIHUAYUAN commented 3 years ago

Hi, From your paper, it said that you get the longest ORF using the gffread, but the output of stringtie does not have the CDS information. Could you please tell me how you get it?

alevar commented 3 years ago

Hi!

StringTie indeed does not perform any ORF analysis. In the paper we used gffread with the '-y' option to obtain aminoacid sequence for each ORF in other databases (GENCODE, RefSeq, etc) and selected the longest one for each gene. The identification of transcripts which can accommodate an ORF from other databases was performed via a combination of gffcompare and custom scripts.

If you are interested in identifying transcripts in your datasets which can "fit" an ORF from other sources (GENCODE,RefSeq, CHESS, etc) you could follow a similar approach as above. Alternatively, as part of our current work we have developed a small tool: https://github.com/alevar/genomic_scripts/tree/main/orfanage which specifically searches for the best known ORF for each query transcript. This can handle cases where novel exons may introduce additional amino-acids into the protein sequence; inframe and out-of-frame changes, etc. The method is currently still in development, but the basic functionality should work well. To run this tool, here is a sample command "orfanage -i -o -c <comma-separated list of known sources with CDS records in GTF/GFF format> -r . The method will produce several outputs, but the main one has the suffix ".perfect.gtf" and will contain all transcripts with added CDS records for which an known ORF fits without any modifications (ie. changes to the aminoacid composition).

Hope this helps and please let me know if I may assist you with anything else!

Ales

ZHIDIHUAYUAN commented 3 years ago

Hi, Thank you very much for your apply. I still have some questions. (1) Why do you use gffcompare instend of stringtie --merge when merging the gtfs from multiple samples? (2) How do you calculate the TPM value? Does it come from the outputs of stringtie? (3) I used the bam files from three GM12878 samples to follow your pipeline. But I found that the outlier TPM of known lncRNA transcripts was ~1.68, but yours was ~13.8. Does it mean that I have something wrong? Or is it because your data contains many different organizations?

Best wishes

Mengyu Zhou

alevar commented 3 years ago
  1. Gffcompare is a dedicated software which provides finer control over merging of GTF/GFF files and computes several additional outputs. We've been using it in place of StringTie --merge ever since the software was first conceived and implemented. That being said, for large datasets, running gffcompare might be challenging due to computational resources required to compute all key outputs (some of which might not always be necessary). Trmap has been implemented to perform many of the same tasks as gffcompare, but more efficiently.

  2. The TPM values are indeed computed by StringTie for each sample which you assemble. You can find it in the "TPM" tag which StringTie will put in the attributes field for each "transcript" feature which it assembles, along with the coverage and FPKM metrics. However, there is no implementation to "combine" TPM values from multiple samples when merging them using "gffcompare" or any of the other methods. Gffcompare will output a list of TPM for each "merged" transcript, but it's up to the end-user to decide how to combine/interpret these values for a merged transcript.

  3. Indeed, using a different dataset will produce different expression values for the assembled genes/transcripts, due to the biological variation in the expression of each gene. While TPM values are normalized to allow for better cross-sample comparisons, the normalization is by no means perfect either.

Hope this helps clear up your questions! Please let me know if there is anything else I may be able to assist you with.

Best,

Ales

ZHIDIHUAYUAN commented 3 years ago

Hi, Thank you very much. I still have a question. (1) From your paper, you used the Refseq annotation file which contains the pseudo=TRUE and partial=TRUE tag, but if I used the GENCODE annotation ,I can not find these tags. How can I decide the expression outlier of known transcript? (2)Could I just use all the lncRNA and mRNA transcript form the single loci genes to decide this value?