Runninf TransDecoder at a higher -m paremeter

mollylRivers commented 12 months ago

Hi Brian,

This isn't so much an issue as a question about how TransDecoder works. I have received a transcriptome assembled from a combination of short-read and long-read sequencing using rnaSPAdes. The transcriptome has already had CD-HIT and TransDecoder applied. However, when TransDecoder was applied, -m 50 was used. This has resulted in a very large transcriptome (~400,000 transcripts), which is very unlikely to be true in this case. I have used the default -m 100 on my other transcriptomes of closely related species and have much smaller transcriptomes (~30,000 - 90,000 transcripts). I tried to apply TransDecoder to the large processed transcriptome, CD-HIT and TransDecoder applied at -m 50, and this caused the transcriptome to drastically reduce in size to ~ 17,000 transcripts. I unfortunately don't have access to the un-processed transcriptome, so I can't test how the second run of TransDecoder affects the transcriptome size.

I am wondering what is happening with this second run of TransDecoder, is it sound to run it a second time with a higher -m cutoff? I assumed that as I am changing the minimum amino acid length of the ORFs it would just remove those between 50 and 100 amino acids in length. But it is unlikely that there are over 350,000 reads that are this length in my transcriptome. Could this potentially have to do with the increased rate of false positive ORF predictions with the reduced length parameter?

Many thanks for your help, Molly

brianjohnhaas commented 12 months ago

Hi Molly,

The number of ORF candidates will climb at an exponential rate with decreasing ORF size, which might explain the differences that you're seeing here. Ideally, TransDecoder would be run on the entire transcriptome and not just the earlier predicted orfs, though, as that will introduce some additional bias for sure.

There's a number of ways of dealing with the large number of transcripts that get assembled including other ways to filter based on expression stats: https://github.com/trinityrnaseq/trinityrnaseq/wiki/There-are-too-many-transcripts!-What-do-I-do%3F

Hope this helps,

B

On Thu, Nov 16, 2023 at 8:43 AM mollylRivers @.***> wrote:

Hi Brian,

This isn't so much an issue as a question about how TransDecoder works. I have received a transcriptome assembled from a combination of short-read and long-read sequencing using rnaSPAdes. The transcriptome has already had CD-HIT and TransDecoder applied. However, when TransDecoder was applied, -m 50 was used. This has resulted in a very large transcriptome (~400,000 transcripts), which is very unlikely to be true in this case. I have used the default -m 100 on my other transcriptomes of closely related species and have much smaller transcriptomes (~30,000 - 90,000 transcripts). I tried to apply TransDecoder to the large processed transcriptome, CD-HIT and TransDecoder applied at -m 50, and this caused the transcriptome to drastically reduce in size to ~ 17,000 transcripts. I unfortunately don't have access to the un-processed transcriptome, so I can't test how the second run of TransDecoder affects the transcriptome size.

I am wondering what is happening with this second run of TransDecoder, is it sound to run it a second time with a higher -m cutoff? I assumed that as I am changing the minimum amino acid length of the ORFs it would just remove those between 50 and 100 amino acids in length. But it is unlikely that there are over 350,000 reads that are this length in my transcriptome. Could this potentially have to do with the increased rate of false positive ORF predictions with the reduced length parameter?

Many thanks for your help, Molly

— Reply to this email directly, view it on GitHub https://github.com/TransDecoder/TransDecoder/issues/191, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX5JEN3DQRRRRK2PNP3YEYJ7RAVCNFSM6AAAAAA7OFYCLWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4TMOBYGUYTSMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

mollylRivers commented 11 months ago

Hi Brian,

Thanks so much for this information. When I looked at the transcript length distribution of the transcriptome before and after applying TransDecoder again (at -m 100), I found that when TransDecoder had been applied at -m 50 there were ~400,000 transcripts less than 1,000 bp in length. But, after reapplication of TransDecoder at -m 100, there were 0 transcripts less than 1,00 bp in length. This would explain the large decrease in transcriptome size. I wonder why this would happen and if there is a way I can prevent this from happening, as there will likely be transcripts of interest that are less than 1,000 bp in length?

Many thanks, Molly

brianjohnhaas commented 11 months ago

Hi Molly,

It sounds like something didn't run correctly when it was rerun with the -m 100 cutoff. When you rerun it, be sure to do it in a new working directory, otherwise it'll try to reuse earlier intermediates. There should be many transcripts less than 1kb that are capable of producing a peptide at the 100 aa cutoff.

best,

Brian

On Mon, Nov 20, 2023 at 9:13 AM mollylRivers @.***> wrote:

Hi Brian,

Thanks so much for this information. When I looked at the transcript length distribution of the transcriptome before and after applying TransDecoder again (at -m 100), I found that when TransDecoder had been applied at -m 50 there were ~400,000 transcripts less than 1,000 bp in length. But, after reapplication of TransDecoder at -m 100, there were 0 transcripts less than 1,00 bp in length. This would explain the large decrease in transcriptome size. I wonder why this would happen and if there is a way I can prevent this from happening, as there will likely be transcripts of interest that are less than 1,000 bp in length?

Many thanks, Molly

— Reply to this email directly, view it on GitHub https://github.com/TransDecoder/TransDecoder/issues/191#issuecomment-1819138240, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKXYEMVL2OJ72PQIM7QDYFNQPJAVCNFSM6AAAAAA7OFYCLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJZGEZTQMRUGA . You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

mollylRivers commented 11 months ago

Hi Brian,

Thank you for your response. I did retry running in a new directory but there was no change in the number of transcripts produced. I have attached a txt file with the error message I produced. I am not really sure what the problem might be, hopefully you can help me decipher it.

Many thanks, Molly

TransDecoder_m100_error_message.txt

brianjohnhaas commented 11 months ago

Hi Molly,

I might need to try to run this myself to see what's going on.

Can you share this file with me? Catra_transcripts_Lyons_Lab_cdhit95_stranded.fasta

That's your input to transdecoder, right?

You can send privately to bhaas at broadinstitute dot org

best,

Brian

On Tue, Nov 21, 2023 at 9:34 AM mollylRivers @.***> wrote:

Hi Brian,

Thank you for your response. I did retry running in a new directory but there was no change in the number of transcripts produced. I have attached a txt file with the error message I produced. I am not really sure what the problem might be, hopefully you can help me decipher it.

Many thanks, Molly

TransDecoder_m100_error_message.txt

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas

brianjohnhaas commented 11 months ago

Hi Molly,

thx for sharing the file, but what was shared looks to be protein sequences instead of the target transcriptome. If this is the file that was used with transdecoder, then that would explain the problem. Otherwise, I'll need the transcriptome to explore further.

best, Brian

On Tue, Nov 21, 2023 at 9:53 AM Brian Haas @.***> wrote:

Hi Molly,

I might need to try to run this myself to see what's going on.

Can you share this file with me? Catra_transcripts_Lyons_Lab_cdhit95_stranded.fasta

That's your input to transdecoder, right?

You can send privately to bhaas at broadinstitute dot org

best,

Brian

On Tue, Nov 21, 2023 at 9:34 AM mollylRivers @.***> wrote:

Hi Brian,

Thank you for your response. I did retry running in a new directory but there was no change in the number of transcripts produced. I have attached a txt file with the error message I produced. I am not really sure what the problem might be, hopefully you can help me decipher it.

Many thanks, Molly

TransDecoder_m100_error_message.txt

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

mollylRivers commented 11 months ago

Hi Brian, Thanks for this, it seems that that may have been the problem all along. Sorry for wasting your time. Thanks, Molly

TransDecoder / TransDecoder

Runninf TransDecoder at a higher -m paremeter #191

--

--

--

--

--