Develop a canonical geoduck transcriptome (or transcriptomes)

sr320 commented 5 years ago

We have a "mega" transcriptome - but it is not realistic- 1M contigs I think we need to create a more realistic transcriptome.

https://github.com/RobertsLab/resources/wiki/Genomic-Resources#panopea-generosa v5

This is from "all" of the libraries. https://robertslab.github.io/sams-notebook/2018/09/04/transcriptome-assembly-geoduck-rnaseq-data.html

[sr320@mox1 geoduck]$ grep -c ">" Pgenerosa_transcriptome_v5.fasta 
1363959

shellywanamaker commented 5 years ago

looking through the fastqs Sam used to make this mega transcriptome, there are multiple tissue types, juvenile, larvae, pooled samples, etc. @sr320 do we want to focus on only the juvenile data for this "more specific" transcriptome?

sr320 commented 5 years ago

Good question - or maybe there is some way we create separate transcriptomes for each tissue. - then combine to best represent actual expression genes?

kubu4 commented 5 years ago

This transcriptome has not been filtered in any fashion, so includes all isoforms, partial transcripts, and all sizes. It can/should be filtered and probably "compressed". However, I'd argue that having this ridiculously large transcriptome is very useful for the genome annotation(s), as it provides a wealth of evidence that's used for generating the gene models (i.e. intron/exon boundaries, SNPs, UTRs, etc) that would potentially be discarded during any filtering/compression.

shellywanamaker commented 5 years ago

What is the goal of this "more realistic" transcriptome? Is it to complement the genome annotations?

sr320 commented 5 years ago

In my mind there are likely about 30-50k genes. and it would be nice to know what those are so we can compare across species.

And I guess I am presuming compressing the 1M current contigs will not get us there.

My concern is that we could be handicapping gene models by suggesting there are multiple genes when in fact it is an artifact of assembly.

shellywanamaker commented 5 years ago

@ksil91 , @sr320 mentioned you found some software to simplify your oly transcriptome. What software/pipeline did you end up using?

ksil91 commented 5 years ago

It sounds like you have different tissues/life stages, but they are all from the same population? In that case, I recommend DRAP. It worked great for me (down to 50k contigs), and that's even with individuals pooled from really diverged scallop populations. You can do runDRAP on each library type separately (tissue/larval/treatment/etc), which will provide you with realistically sized individual assemblies, then metaDRAP to combine across libraries to make a single transcriptome. DRAP installs best as a Docker container, or another container manager (on my cluster they installed using Singularity), otherwise it would be a huge pain to install as it has many dependencies. If you somehow do have issues installing it, I can run it on my cluster for you anytime before they kick me off April 1 as I'm not running anything large mem.

If you have sequenced multiple populations, let me know and I can discuss my assembly approach for that.

I also heard from Melissa DeBiasse that their lab has found you can have TOO MANY reads in Trinity, which leads to these mega assemblies. She said they are writing up a paper about subsampling reads. If you have samples across multiple lanes, it might be worth trying an assembly (DRAP or otherwise) with only one lane per sample, or randomly sampling half of reads.

Re: Sam's comment about the large transcriptome being useful for gene models, I think it would be better to have each library type assembled separately (and filtered/compressed a little) and use them that way, as opposed to a mega assembly. I think the C. gigas annotation used separate transcriptomes for development stage/stressor/tissue in their assembly annotation.

kubu4 commented 5 years ago

Cool! Thanks for all of this! We'll revamp things a bit!

On Thu, Feb 14, 2019, 05:59 Katherine Silliman <notifications@github.com wrote:

It sounds like you have different tissues/life stages, but they are all from the same population? In that case, I recommend DRAP http://www.sigenae.org/drap/index.html. It worked great for me (down to 50k contigs), and that's even with individuals pooled from really diverged scallop populations. You can do runDRAP on each library type separately (tissue/larval/treatment/etc), which will provide you with realistically sized individual assemblies, then metaDRAP to combine across libraries to make a single transcriptome. DRAP installs best as a Docker container, or another container manager (on my cluster they installed using Singularity), otherwise it would be a huge pain to install as it has many dependencies. If you somehow do have issues installing it, I can run it on my cluster for you anytime before they kick me off April 1 as I'm not running anything large mem.

If you have sequenced multiple populations, let me know and I can discuss my assembly approach for that.

I also heard from Melissa DeBiasse https://melissadebiasse.weebly.com/ that their lab has found you can have TOO MANY reads in Trinity, which leads to these mega assemblies. She said they are writing up a paper about subsampling reads. If you have samples across multiple lanes, it might be worth trying an assembly (DRAP or otherwise) with only one lane per sample, or randomly sampling half of reads.

Re: Sam's comment about the large transcriptome being useful for gene models, I think it would be better to have each library type assembled separately (and filtered/compressed a little) and use them that way, as opposed to a mega assembly. I think the C. gigas annotation used separate transcriptomes for development stage/stressor/tissue in their assembly annotation.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/576#issuecomment-463636906, or mute the thread https://github.com/notifications/unsubscribe-auth/AEThOJQ-4LU0RSTR6AUtUTSAQVsBoeTZks5vNWvYgaJpZM4a6QK4 .

shellywanamaker commented 5 years ago

Thanks @ksil91! I believe the RNAseq data comes from different experiments and likely from different geoduck populations. What was your assembly approach for different populations? Thanks again!

sr320 commented 5 years ago

Certainly juveniles/larvae and adult are different populations

Thanks steven On Feb 14, 2019, 9:19 AM -0800, Shelly Trigg notifications@github.com, wrote:

Thanks @ksil91! I believe the RNAseq data comes from different experiments and likely from different geoduck populations. What was your assembly approach for different populations? Thanks again! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ksil91 commented 5 years ago

Sending email with some attachments for this other method using Orthofinder. However, thinking about it more I would not recommend using this OrthoFinder method for generating a transcriptome to use for annotating the genome, as it will throw away genes that are not found across all libraries (so would not include genes that are only expressed in larvae for example). This was useful for what I was doing (differential gene expression across multiple populations), but might not be useful for what you are trying to do. In that case, I would still recommend using DRAP then metaDRAP. Also, you can use either Oasis or Trinity within DRAP, and if you read the paper it shows that Oasis with multiple kmers performs better than Trinity, so I used Oasis.

sr320 commented 5 years ago

@kubu4 - lets dig back into this if not already

kubu4 commented 5 years ago

I think it would be better to have each library type assembled separately (and filtered/compressed a little)

I've done this (but am still waiting on heart tissue - assembly has been strangely problematic and lengthy).

kubu4 commented 5 years ago

Could not install on Mox (one dependency no longer available and can't use Docker container on Mox).

Have installed Docker container on Emu - currently running through test data sets.

Will update once running with our data.

RobertsLab / resources

Develop a canonical geoduck transcriptome (or transcriptomes) #576