Open taylorreiter opened 10 months ago
Merging all of the non-isoseq transcriptomes together and clustering them with cd-hit-est -i merged.fa -o merged_cdhit.fa -c 1 -T 6 -M 12000
, we see a 19% reduction in the number of transcripts observed. This suggests that there is overlap in the content of some of these transcriptomes. Some of this is to be expected, as each assembly group is assembled twice (once with rna spades and once with trinity), but including just to record another data point.
$ grep ">" merged.fa | wc -l
5985783
$ grep ">" merged_cdhit.fa | wc -l
4839615
Just realized it would be pretty fast to trial with just RNA spades (since this actually runs quickly and doesn't bloat your whole hard drive, unlike trinity), so I going to do some light testing with assembly groups using that and then move forward. Will update here with results as I get them.
Some transrate updates:
These facts lead me to:
I'm currently running busco on cd-hitted (1.0) merged assemblies to see which one i should move forward with:
Right now, assembly groups are a user-input parameter.
Background
This is similar to the co-assembly problem in metagenomics. Anecdotally, co-assembly is especially popular for time series samples.
Initial trial
designated assembly groups:
read type (pe or se) + tick origin state + time + tissue + sex + treatment
Only showing rna spade results so it's easier to see
The assemblies finish fine, but transrate in the orthofuser step and evidential gene both fail, I think because there are ~56 assemblies and that's just too much. This means I have to reduce the number of assemblies we're dealing with in order to deduplicate.
Next steps
I'm thinking of three options:
read type (pe or se) + tissue + sex
This would give:I think all three of these are reasonable solutions. I would like to run them all and compare, but Trinity takes FOREVER to run, so this might be a future investigation. For now, I think I'm going to go with the option 1.