Examine differential gene expression in hemato-crab RNA-seq data

sr320 commented 4 years ago

Starting with raw data, comparing libraries of "infected" crabs under different conditions with the intent of finding all (hemato and crab etc) differentially expressed genes. Using Trinity pipeline.

aspencoyle commented 3 years ago

Hey, I got the Trinity pipeline up and running (I believe) for my first sample and...it's taking an extremely long time. Took about 4 hours to run through the Inchworm, Chrysalis, and Butterfly stages, which isn't too bad, but it entered Phase 2 about 8 hours ago and is 11% complete. I've got 4 total comparisons to make, which is at least 400 hours of runtime. Any suggestions for speeding things up (maybe remoting into a lab computer?) or should I just let it chug along?

Spec-wise, in order to still keep my computer usable, I assigned it 10/16 GB of ram and 2/4 CPU cores.

kubu4 commented 3 years ago

You shouldn't need to run Trinity, as we already have an assembly (Trinity is a transcriptome assembler, primarily). You should instead look using DEseq2 or EdgeR for the gene expression analysis. You may also need to use Kallisto (or Salmon) to generate gene counts.

aspencoyle commented 3 years ago

...well shoot, there goes a bunch of work. I guess it's good news though, as it moves me past that particular roadblock. Thanks, Sam!

kubu4 commented 3 years ago

Gah! Just saw @sr320's post RE: Trinity pipeline! Sorry! To that end, you'll really need to run Trinity on a higher end computer. It would be worth exploring other computing optioms we have available, like Emu/Roadrunner or using Hyak/Mox. Check the GitHub Resources wiki for info on all of these. Post any questions and we'll get you going.

aspencoyle commented 3 years ago

Fantastic, I'll start on that! Thanks again!

sr320 commented 3 years ago

Sorry - to clarify - no need to develop an actual transcriptome at this point. Use what we have. I was just suggesting that the downstream pipeline trinity recommends is a good place to start .. see https://d.pr/i/YTXFas

aspencoyle commented 3 years ago

Hey, I'm trying to build a transcript expression matrix using the following protocol: https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification (under the header "Build Transcript and Gene Expression Matrices")

I'm a little unclear on what the gene-to-transcript mapping file is, and can't find much info on it. Any advice?

Also I went ahead and ran the script without a gene-to-transcript mapping file, and got the following output. I also got all the expected files and they all look okay - just wanted to check that none of the errors are an issue!

/mnt/c/Users/acoyl/Documents/GradSchool/RobertsLab/Tools/Trinity/trinityrnaseq-v2.11.0/util/support_scripts/run_TMM_scale_matrix.pl --matrix kallisto.isoform.TPM.not_cross_norm > kallisto.isoform.TMM.EXPR.matrixCMD: R --no-save --no-restore --no-site-file --no-init-file -q < kallisto.isoform.TPM.not_cross_norm.runTMM.R 1>&2
/mnt/c/Users/acoyl/Downloads/anaconda3/lib/R/bin/exec/R: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory
Error, cmd: R --no-save --no-restore --no-site-file --no-init-file -q < kallisto.isoform.TPM.not_cross_norm.runTMM.R 1>&2  died with ret (32512)  at /mnt/c/Users/acoyl/Documents/GradSchool/RobertsLab/Tools/Trinity/trinityrnaseq-v2.11.0/util/support_scripts/run_TMM_scale_matrix.pl line 105.
Error, CMD: /mnt/c/Users/acoyl/Documents/GradSchool/RobertsLab/Tools/Trinity/trinityrnaseq-v2.11.0/util/support_scripts/run_TMM_scale_matrix.pl --matrix kallisto.isoform.TPM.not_cross_norm > kallisto.isoform.TMM.EXPR.matrix died with ret 6400 at /mnt/c/Users/acoyl/Documents/GradSchool/RobertsLab/Tools/Trinity/trinityrnaseq-v2.11.0/util/abundance_estimates_to_matrix.pl line 385.

kubu4 commented 3 years ago

I'm a little unclear on what the gene-to-transcript mapping file is, and can't find much info on it. Any advice?

For your purposes, you don't need it.

However, this is a feature of Trinity that's pretty nice if you've run a de novo transcriptome assembly with Trinity. After a de novo assembly you frequently end up with a large number of isoforms for a given gene. Trinity can help reduce this data set for gene expression analysis by taking FastQ data and making sure everything gets processed at the gene level instead of at the transcript (i.e. isoform) level.

Also I went ahead and ran the script

Can you please post the command you ran?

aspencoyle commented 3 years ago

Ooh gotcha, thanks for letting me know!

Here's the command I ran:

%%bash /mnt/c/Users/acoyl/Documents/GradSchool/RobertsLab/Tools/Trinity/trinityrnaseq-v2.11.0/util/abundance_estimates_to_matrix.pl \ --est_method kallisto \ --gene_trans_map 'none' \ --out_prefix kallisto \ --name_sample_by_basedir \ library02/abundance.tsv \ library04/abundance.tsv \ library06/abundance.tsv \ library08/abundance.tsv \ library10/abundance.tsv

I also checked some of Steven's old code and it looks like he was getting similar (though non-identical) error messages at that point: https://github.com/sr320/nb-2020/blob/master/C_bairdi/20-kallisto.ipynb

As a follow-up question, when I tried to read the resulting matrix into R for DESeq2 analysis (following this: https://github.com/sr320/nb-2019/blob/master/C_bairdi/11-Deseq2.Rmd), I kept getting an error message when running DESeqDataSetFromMatrix (line 48), as my counts weren't integers. Maybe the problem is that Kallisto gave me estimated counts to several decimal places rather than raw or rounded counts? Or is something completely different going wrong?

kubu4 commented 3 years ago

As a follow-up question, when I tried to read the resulting matrix into R for DESeq2 analysis (following this: https://github.com/sr320/nb-2019/blob/master/C_bairdi/11-Deseq2.Rmd), I kept getting an error message when running DESeqDataSetFromMatrix (line 48), as my counts weren't integers. Maybe the problem is that Kallisto gave me estimated counts to several decimal places rather than raw or rounded counts? Or is something completely different going wrong?

Please open a different issue.

kubu4 commented 3 years ago

expected files and they all look okay - just wanted to check that none of the errors are an issue!

You should be fine. The errors you received are related to missing R packages. If you don't need/want Trinity to do any further expression analysis, then you don't need to worry about these R-related errors; the R commands are built-in to the abundance_estimates_to_matrix.pl script to allow for a full expression analysis pipeline.

aspencoyle commented 3 years ago

Alright, fairly big potential problem! All of my analyses have been examining 5 pooled libraries of infected crabs, as follows: Library 2: Day 2, lowered temperature Library 4: Day 2, elevated temperature Library 6: Day 0, ambient temperature Library 8: Day 2, all temperatures pooled Library 10: Day 17, all temperatures pooled

My goal has been to look at differential expression at various temps by comparing 2 vs 4, 2 vs 6, and 4 vs 6, and then look at differential expression over time by comparing 8 vs 10.

The problem is that...I have no replication, and so I can't create 4 different DESeq options (one for each comparison) and analyze them separately. When I try, I get the following error:

Error in checkForExperimentalReplicates(object, modelMatrix) :

The design matrix has the same number of samples and coefficients to fit, so estimation of dispersion is not possible. Treating samples as replicates was deprecated in v1.20 and no longer supported since v1.22.

As a potential workaround, I was able to create 2 different DESeq objects - one treating day as the condition, the other treating temperature as the condition. That creates some replication (3 Day 2 libraries, 2 all-temperature libraries) and so I analyzed them separately. However, this produced some fairly iffy-looking graphs.

Link to code used to create and analyze the 2 different objects (within "OPTION 2" section): https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/DESeq2_script.R

Link to graphs and tables produced: https://github.com/afcoyle/hemat_bairdii_transcriptome/tree/main/graphs/DESeq_Option_2

Since this doesn't look like a particularly enlightening analysis, here are several alternative plans I came up with. All involve rerunning the analysis, but using individual libraries (should I include the pooled libraries as well?)

The main difficulty is that we have no replicates of infected crab at lowered temperatures for each date. We have 1 at Day 0, 1 at Day 2, and 1 at Day 17. We also have no libraries of infected crab at elevated temperatures at Day 17.

All the below numbers are the number of individual libraries only

Option 1: Pool all infected samples purely by temperature, look at effect of temp.

6 libraries at elevated temperature
9 libraries at ambient temperature
3 libraries at lowered temperature

Option 2: Pool all infected samples from Day 0-2 by temp, look at effect of temp

6 libraries at elevated temperature
6 libraries at ambient temperature
2 libraries at lowered temperature

Option 3: Pool all infected samples by date, look at effect of date

7 libraries at Day 0
7 libraries at Day 2
4 libraries at Day 17

Option 4: Keep all samples separate by date/temp combination

Difficulty: No replicates for several combinations, not sure if that can work
Workaround: Can look at expression at ambient/high temps for Days 0-2 (3 libraries at each combination of temp and date)
3 libraries at Day 0, high temp
3 libraries at Day 2, high temp
0 libraries at Day 17, high temp
3 libraries at Day 0, ambient temp
3 samples at Day 2, ambient temp
1 sample at Day 17, ambient temp
1 sample at Day 0, lowered temp
1 sample at Day 2, lowered temp
1 sample at Day 17, lowered temp

Alright, that's a whole bunch of info, so I'll do my best to sum up the questions:

Is the analysis I did actually alright, despite looking not-alright and having practically no replication?
Should I start from scratch with my analysis with the individual libraries?
If so, should I include the pooled libraries?
And if so, how should I structure my new analysis?

kubu4 commented 3 years ago

Treating samples as replicates was deprecated in v1.20 and no longer supported since v1.22.

Haven't had time to digest everything above, but you could try installing v1.20 and treat samples as reps...

aspencoyle commented 3 years ago

I'm thinking about giving it a shot in edgeR instead - they support analysis libraries without replication (although can't find significance) , and I feel like it'd be neat to learn a few different methods of analyzing differential expression!

But I guess the core question (which is probably more for @sr320) is about the end goal of this. If the goal is to find significantly differently-expressed genes, starting over and looking at the individual libraries really seems to me like the only option, assuming there's not something I'm missing. But if the goal is just a descriptive analysis with no significance analysis, maybe I'd be alright with either using v1.20 of DESeq2 or switching over to edgeR with my current libraries.

sr320 commented 3 years ago

I suggest the latter. Simply get the end with DESeq2 with a gene list etc - for example just compare 4 libraries Day 0 to 4 libraries Day 17. Point now is get through pipeline. Then we discuss best way forward.

aspencoyle commented 3 years ago

Alright, I believe I finished my first comparison (Day 0 ambient vs. Day 17 ambient)!

Here's a link to my notebook post, which has a link to all my results: https://afcoyle.github.io/2020-12-18-Comparing-Amb-Over-Time/

Link to the table of differentially-expressed genes (pval <= 0.005): https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/graphs/day0_day17_ambient/0vs17_DEGlist.txt

Assuming that all looks good, starting on comparing different temperatures. In order to get enough libraries, I'm pooling Day 0 and 2 crabs by temperature treatment, as described in the notebook post from today and yesterday

Link to that notebook post from yesterday: https://afcoyle.github.io/2020-12-17-Indiv-Libraries/

sr320 commented 3 years ago

Take that gene list to a functional annotation - enrichment analysis.(ultimate ending). Then lets meet and talk about fuller design.

On Fri, Dec 18, 2020 at 9:53 AM afcoyle notifications@github.com wrote:

Alright, I believe I finished my first comparison (Day 0 ambient vs. Day 17 ambient)!

Here's a link to my notebook post, which has a link to all my results: https://afcoyle.github.io/2020-12-18-Comparing-Amb-Over-Time/

Link to the table of differentially-expressed genes (pval <= 0.005): https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/graphs/day0_day17_ambient/0vs17_DEGlist.txt

Assuming that all looks good, starting on comparing different temperatures. In order to get enough libraries, I'm pooling Day 0 and 2 crabs by temperature treatment, as described in the notebook post from today and yesterday

Link to that notebook post from yesterday: https://afcoyle.github.io/2020-12-17-Indiv-Libraries/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/1010#issuecomment-748229653, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABB4PNYVSK25BHPRTUZN4K3SVOJGHANCNFSM4ST2C2QA .

RobertsLab / resources

Examine differential gene expression in hemato-crab RNA-seq data #1010