Closed sr320 closed 3 years ago
Hey, I got the Trinity pipeline up and running (I believe) for my first sample and...it's taking an extremely long time. Took about 4 hours to run through the Inchworm, Chrysalis, and Butterfly stages, which isn't too bad, but it entered Phase 2 about 8 hours ago and is 11% complete. I've got 4 total comparisons to make, which is at least 400 hours of runtime. Any suggestions for speeding things up (maybe remoting into a lab computer?) or should I just let it chug along?
Spec-wise, in order to still keep my computer usable, I assigned it 10/16 GB of ram and 2/4 CPU cores.
You shouldn't need to run Trinity, as we already have an assembly (Trinity is a transcriptome assembler, primarily). You should instead look using DEseq2 or EdgeR for the gene expression analysis. You may also need to use Kallisto (or Salmon) to generate gene counts.
...well shoot, there goes a bunch of work. I guess it's good news though, as it moves me past that particular roadblock. Thanks, Sam!
Gah! Just saw @sr320's post RE: Trinity pipeline! Sorry! To that end, you'll really need to run Trinity on a higher end computer. It would be worth exploring other computing optioms we have available, like Emu/Roadrunner or using Hyak/Mox. Check the GitHub Resources wiki for info on all of these. Post any questions and we'll get you going.
Fantastic, I'll start on that! Thanks again!
Sorry - to clarify - no need to develop an actual transcriptome at this point. Use what we have. I was just suggesting that the downstream pipeline trinity recommends is a good place to start .. see https://d.pr/i/YTXFas
Hey, I'm trying to build a transcript expression matrix using the following protocol: https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification (under the header "Build Transcript and Gene Expression Matrices")
I'm a little unclear on what the gene-to-transcript mapping file is, and can't find much info on it. Any advice?
Also I went ahead and ran the script without a gene-to-transcript mapping file, and got the following output. I also got all the expected files and they all look okay - just wanted to check that none of the errors are an issue!
/mnt/c/Users/acoyl/Documents/GradSchool/RobertsLab/Tools/Trinity/trinityrnaseq-v2.11.0/util/support_scripts/run_TMM_scale_matrix.pl --matrix kallisto.isoform.TPM.not_cross_norm > kallisto.isoform.TMM.EXPR.matrixCMD: R --no-save --no-restore --no-site-file --no-init-file -q < kallisto.isoform.TPM.not_cross_norm.runTMM.R 1>&2
/mnt/c/Users/acoyl/Downloads/anaconda3/lib/R/bin/exec/R: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory
Error, cmd: R --no-save --no-restore --no-site-file --no-init-file -q < kallisto.isoform.TPM.not_cross_norm.runTMM.R 1>&2 died with ret (32512) at /mnt/c/Users/acoyl/Documents/GradSchool/RobertsLab/Tools/Trinity/trinityrnaseq-v2.11.0/util/support_scripts/run_TMM_scale_matrix.pl line 105.
Error, CMD: /mnt/c/Users/acoyl/Documents/GradSchool/RobertsLab/Tools/Trinity/trinityrnaseq-v2.11.0/util/support_scripts/run_TMM_scale_matrix.pl --matrix kallisto.isoform.TPM.not_cross_norm > kallisto.isoform.TMM.EXPR.matrix died with ret 6400 at /mnt/c/Users/acoyl/Documents/GradSchool/RobertsLab/Tools/Trinity/trinityrnaseq-v2.11.0/util/abundance_estimates_to_matrix.pl line 385.
I'm a little unclear on what the gene-to-transcript mapping file is, and can't find much info on it. Any advice?
For your purposes, you don't need it.
However, this is a feature of Trinity that's pretty nice if you've run a de novo transcriptome assembly with Trinity. After a de novo assembly you frequently end up with a large number of isoforms for a given gene. Trinity can help reduce this data set for gene expression analysis by taking FastQ data and making sure everything gets processed at the gene level instead of at the transcript (i.e. isoform) level.
Also I went ahead and ran the script
Can you please post the command you ran?
Ooh gotcha, thanks for letting me know!
Here's the command I ran:
%%bash /mnt/c/Users/acoyl/Documents/GradSchool/RobertsLab/Tools/Trinity/trinityrnaseq-v2.11.0/util/abundance_estimates_to_matrix.pl \ --est_method kallisto \ --gene_trans_map 'none' \ --out_prefix kallisto \ --name_sample_by_basedir \ library02/abundance.tsv \ library04/abundance.tsv \ library06/abundance.tsv \ library08/abundance.tsv \ library10/abundance.tsv
I also checked some of Steven's old code and it looks like he was getting similar (though non-identical) error messages at that point: https://github.com/sr320/nb-2020/blob/master/C_bairdi/20-kallisto.ipynb
As a follow-up question, when I tried to read the resulting matrix into R for DESeq2 analysis (following this: https://github.com/sr320/nb-2019/blob/master/C_bairdi/11-Deseq2.Rmd), I kept getting an error message when running DESeqDataSetFromMatrix (line 48), as my counts weren't integers. Maybe the problem is that Kallisto gave me estimated counts to several decimal places rather than raw or rounded counts? Or is something completely different going wrong?
As a follow-up question, when I tried to read the resulting matrix into R for DESeq2 analysis (following this: https://github.com/sr320/nb-2019/blob/master/C_bairdi/11-Deseq2.Rmd), I kept getting an error message when running DESeqDataSetFromMatrix (line 48), as my counts weren't integers. Maybe the problem is that Kallisto gave me estimated counts to several decimal places rather than raw or rounded counts? Or is something completely different going wrong?
Please open a different issue.
expected files and they all look okay - just wanted to check that none of the errors are an issue!
You should be fine. The errors you received are related to missing R
packages. If you don't need/want Trinity to do any further expression analysis, then you don't need to worry about these R
-related errors; the R
commands are built-in to the abundance_estimates_to_matrix.pl
script to allow for a full expression analysis pipeline.
Alright, fairly big potential problem! All of my analyses have been examining 5 pooled libraries of infected crabs, as follows: Library 2: Day 2, lowered temperature Library 4: Day 2, elevated temperature Library 6: Day 0, ambient temperature Library 8: Day 2, all temperatures pooled Library 10: Day 17, all temperatures pooled
My goal has been to look at differential expression at various temps by comparing 2 vs 4, 2 vs 6, and 4 vs 6, and then look at differential expression over time by comparing 8 vs 10.
The problem is that...I have no replication, and so I can't create 4 different DESeq options (one for each comparison) and analyze them separately. When I try, I get the following error:
Error in checkForExperimentalReplicates(object, modelMatrix) :
The design matrix has the same number of samples and coefficients to fit, so estimation of dispersion is not possible. Treating samples as replicates was deprecated in v1.20 and no longer supported since v1.22.
As a potential workaround, I was able to create 2 different DESeq objects - one treating day as the condition, the other treating temperature as the condition. That creates some replication (3 Day 2 libraries, 2 all-temperature libraries) and so I analyzed them separately. However, this produced some fairly iffy-looking graphs.
Link to code used to create and analyze the 2 different objects (within "OPTION 2" section): https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/DESeq2_script.R
Link to graphs and tables produced: https://github.com/afcoyle/hemat_bairdii_transcriptome/tree/main/graphs/DESeq_Option_2
Since this doesn't look like a particularly enlightening analysis, here are several alternative plans I came up with. All involve rerunning the analysis, but using individual libraries (should I include the pooled libraries as well?)
The main difficulty is that we have no replicates of infected crab at lowered temperatures for each date. We have 1 at Day 0, 1 at Day 2, and 1 at Day 17. We also have no libraries of infected crab at elevated temperatures at Day 17.
All the below numbers are the number of individual libraries only
Option 1: Pool all infected samples purely by temperature, look at effect of temp.
Option 2: Pool all infected samples from Day 0-2 by temp, look at effect of temp
Option 3: Pool all infected samples by date, look at effect of date
Option 4: Keep all samples separate by date/temp combination
Alright, that's a whole bunch of info, so I'll do my best to sum up the questions:
Treating samples as replicates was deprecated in v1.20 and no longer supported since v1.22.
Haven't had time to digest everything above, but you could try installing v1.20 and treat samples as reps...
I'm thinking about giving it a shot in edgeR instead - they support analysis libraries without replication (although can't find significance) , and I feel like it'd be neat to learn a few different methods of analyzing differential expression!
But I guess the core question (which is probably more for @sr320) is about the end goal of this. If the goal is to find significantly differently-expressed genes, starting over and looking at the individual libraries really seems to me like the only option, assuming there's not something I'm missing. But if the goal is just a descriptive analysis with no significance analysis, maybe I'd be alright with either using v1.20 of DESeq2 or switching over to edgeR with my current libraries.
I suggest the latter. Simply get the end with DESeq2 with a gene list etc - for example just compare 4 libraries Day 0 to 4 libraries Day 17. Point now is get through pipeline. Then we discuss best way forward.
Alright, I believe I finished my first comparison (Day 0 ambient vs. Day 17 ambient)!
Here's a link to my notebook post, which has a link to all my results: https://afcoyle.github.io/2020-12-18-Comparing-Amb-Over-Time/
Link to the table of differentially-expressed genes (pval <= 0.005): https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/graphs/day0_day17_ambient/0vs17_DEGlist.txt
Assuming that all looks good, starting on comparing different temperatures. In order to get enough libraries, I'm pooling Day 0 and 2 crabs by temperature treatment, as described in the notebook post from today and yesterday
Link to that notebook post from yesterday: https://afcoyle.github.io/2020-12-17-Indiv-Libraries/
Take that gene list to a functional annotation - enrichment analysis.(ultimate ending). Then lets meet and talk about fuller design.
On Fri, Dec 18, 2020 at 9:53 AM afcoyle notifications@github.com wrote:
Alright, I believe I finished my first comparison (Day 0 ambient vs. Day 17 ambient)!
Here's a link to my notebook post, which has a link to all my results: https://afcoyle.github.io/2020-12-18-Comparing-Amb-Over-Time/
Link to the table of differentially-expressed genes (pval <= 0.005): https://github.com/afcoyle/hemat_bairdii_transcriptome/blob/main/graphs/day0_day17_ambient/0vs17_DEGlist.txt
Assuming that all looks good, starting on comparing different temperatures. In order to get enough libraries, I'm pooling Day 0 and 2 crabs by temperature treatment, as described in the notebook post from today and yesterday
Link to that notebook post from yesterday: https://afcoyle.github.io/2020-12-17-Indiv-Libraries/
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/1010#issuecomment-748229653, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABB4PNYVSK25BHPRTUZN4K3SVOJGHANCNFSM4ST2C2QA .
Starting with raw data, comparing libraries of "infected" crabs under different conditions with the intent of finding all (hemato and crab etc) differentially expressed genes. Using Trinity pipeline.