Open ZHIDIHUAYUAN opened 3 years ago
Hi!
StringTie indeed does not perform any ORF analysis. In the paper we used gffread with the '-y' option to obtain aminoacid sequence for each ORF in other databases (GENCODE, RefSeq, etc) and selected the longest one for each gene. The identification of transcripts which can accommodate an ORF from other databases was performed via a combination of gffcompare and custom scripts.
If you are interested in identifying transcripts in your datasets which can "fit" an ORF from other sources (GENCODE,RefSeq, CHESS, etc) you could follow a similar approach as above. Alternatively, as part of our current work we have developed a small tool: https://github.com/alevar/genomic_scripts/tree/main/orfanage which specifically searches for the best known ORF for each query transcript. This can handle cases where novel exons may introduce additional amino-acids into the protein sequence; inframe and out-of-frame changes, etc. The method is currently still in development, but the basic functionality should work well. To run this tool, here is a sample command "orfanage -i
Hope this helps and please let me know if I may assist you with anything else!
Ales
Hi, Thank you very much for your apply. I still have some questions. (1) Why do you use gffcompare instend of stringtie --merge when merging the gtfs from multiple samples? (2) How do you calculate the TPM value? Does it come from the outputs of stringtie? (3) I used the bam files from three GM12878 samples to follow your pipeline. But I found that the outlier TPM of known lncRNA transcripts was ~1.68, but yours was ~13.8. Does it mean that I have something wrong? Or is it because your data contains many different organizations?
Best wishes
Mengyu Zhou
Gffcompare is a dedicated software which provides finer control over merging of GTF/GFF files and computes several additional outputs. We've been using it in place of StringTie --merge ever since the software was first conceived and implemented. That being said, for large datasets, running gffcompare might be challenging due to computational resources required to compute all key outputs (some of which might not always be necessary). Trmap has been implemented to perform many of the same tasks as gffcompare, but more efficiently.
The TPM values are indeed computed by StringTie for each sample which you assemble. You can find it in the "TPM" tag which StringTie will put in the attributes field for each "transcript" feature which it assembles, along with the coverage and FPKM metrics. However, there is no implementation to "combine" TPM values from multiple samples when merging them using "gffcompare" or any of the other methods. Gffcompare will output a list of TPM for each "merged" transcript, but it's up to the end-user to decide how to combine/interpret these values for a merged transcript.
Indeed, using a different dataset will produce different expression values for the assembled genes/transcripts, due to the biological variation in the expression of each gene. While TPM values are normalized to allow for better cross-sample comparisons, the normalization is by no means perfect either.
Hope this helps clear up your questions! Please let me know if there is anything else I may be able to assist you with.
Best,
Ales
Hi, Thank you very much. I still have a question. (1) From your paper, you used the Refseq annotation file which contains the pseudo=TRUE and partial=TRUE tag, but if I used the GENCODE annotation ,I can not find these tags. How can I decide the expression outlier of known transcript? (2)Could I just use all the lncRNA and mRNA transcript form the single loci genes to decide this value?
Hi, From your paper, it said that you get the longest ORF using the gffread, but the output of stringtie does not have the CDS information. Could you please tell me how you get it?