Figuring out assembly groups & transcriptome merging

taylorreiter commented 10 months ago

Right now, assembly groups are a user-input parameter.

Background

Deciding which samples should be assembled together is difficult. Co-assembly of multiple samples can improve recovery of assembled transcripts when the data set is low coverage. However, local variation in sequencing data can also confuse the assembly graph, blocking assembly and outputting many fractionated contiguous sequences. Local variation arises from genome variation itself as well as splicing/isoform variation. Genome variation arises when different individuals were sequenced, either in a single pooled RNA-seq sample or between different samples. It comes from the SNPs, indels, and rearrangements that distinguish each individual genome. There may be more genome variation between individuals that come from populations that are more heterogenous than homogenous, but the level of population variation isn’t always known. Isoform variation arises from alternative splicing. While multiple isoforms can be expressed in a single cell or tissue type, it is more likely that a given tissue will only express a single isoform at a given time (e.g. at time of sampling). All of this together means that the ideal assembly group will achieve sufficient coverage without having too much variation. Below we list some things to consider or guidelines to operate by when specifying assembly groups:
- Separate single end from paired end reads. There’s not a great way to do combined SE & PE assembly: https://github.com/trinityrnaseq/trinityrnaseq/wiki/How-do-I-combine-reads%3F#how-do-i-combine-paired-end-and-single-end-reads
- If there is sexual dimorphism in expression, consider separating males and females
- Separate different tissue types, pathogen challenges, time points, or other “groupings” that arise in things like differential expression experiments.
- Separate different individuals or populations.
- Pay attention to sample coverage. When sample coverage gets too low, think about how to strategically collapse groups. While there isn’t a perfect rule of thumb for sufficient coverage (it will at minimum be a function of genome size, haplo/diploid/…, number of transcripts expressed), dipping much below 30 million reads will probably start to have worse results (this is purely a feeling and hasn’t been tested at all, but I have been doing weird transcriptomics for ~7 years). You could try and run something like nonpariel on quality and k-mer trimmed reads if you’re unsure what is best for your system.

This is similar to the co-assembly problem in metagenomics. Anecdotally, co-assembly is especially popular for time series samples.

Initial trial

designated assembly groups: read type (pe or se) + tick origin state + time + tissue + sex + treatment

   assembly_group              n
 1 peaae1cellline                  24
 2 peaae2cellline                  28
 3 peok0sgfemale                    1
 4 peok12sgfemale                   1
 5 peok168sgfemale                  1
 6 peok72sgfemale                   1
 7 petx0wholefemale                13
 8 petx0wholemale                   1
 9 petx120midgutfemale              1
10 petx120sgfemale                  1
11 petx12wholefemaleecoli           3
12 petx24wholefemale                1
13 petx24wholefemaleecoli           3
14 petx3wholefemaleecoli            3
15 petx48midgutfemale               1
16 petx48sgfemale                   1
17 petx6wholefemaleecoli            3
18 petx96midgutfemale               1
19 petx96sgfemale                   1
20 petx96wholefemale                1
21 petxwholefemale                  1
22 petxwholemale                    3
23 seok168wholefemaleinfected       1
24 seok168wholefemaleunexposed      1
25 seok168wholefemaleuninfected     1
26 seok168wholemaleinfected         1
27 seok168wholemaleunexposed        1
28 seok168wholemaleuninfected       1

Only showing rna spade results so it's easier to see

The assemblies finish fine, but transrate in the orthofuser step and evidential gene both fail, I think because there are ~56 assemblies and that's just too much. This means I have to reduce the number of assemblies we're dealing with in order to deduplicate.

Next steps

I'm thinking of three options:

reduce the assembly groups to read type (pe or se) + tissue + sex This would give:

assembly_group    n
1 peaaecellline          52
2 pemidgutfemale          3
3 pesgfemale              7
4 pewholefemale          28
5 pewholemale             4
6 sewhole                 6

Co-assemble based on study. This would give:

study_title                                                                          n
1 Amblyomma americanum RNA-seq following E. coli treatment                            24
2 Amblyomma americanum Raw sequence reads                                              6
3 Amblyomma americanum adult female salivary gland transcriptome                       4
4 Amblyomma americanum strain:Stillwater Oklahoma Transcriptome or Gene expression    14
5 Arthropod Cell Line RNA Seq                                                         52

Just co-assembly everything (still will need to be merged because we'll have Trinity & RNAspades). This is ~mostly the ORP and eelpond/elvers approach.

I think all three of these are reasonable solutions. I would like to run them all and compare, but Trinity takes FOREVER to run, so this might be a future investigation. For now, I think I'm going to go with the option 1.

taylorreiter commented 10 months ago

Merging all of the non-isoseq transcriptomes together and clustering them with cd-hit-est -i merged.fa -o merged_cdhit.fa -c 1 -T 6 -M 12000, we see a 19% reduction in the number of transcripts observed. This suggests that there is overlap in the content of some of these transcriptomes. Some of this is to be expected, as each assembly group is assembled twice (once with rna spades and once with trinity), but including just to record another data point.

$ grep ">" merged.fa | wc -l
5985783
$ grep ">" merged_cdhit.fa | wc -l
4839615

taylorreiter commented 10 months ago

Just realized it would be pretty fast to trial with just RNA spades (since this actually runs quickly and doesn't bloat your whole hard drive, unlike trinity), so I going to do some light testing with assembly groups using that and then move forward. Will update here with results as I get them.

taylorreiter commented 10 months ago

Some transrate updates:

I can't run transrate on the "merged" assemblies -- the genome index is too big, and it fails.
- this is true even if i run cd-hit or mmseqs (with a value of 1 or 0.97)
When I run transrate on individual assemblies, I can't run it with the merged diginormed reads. The bam file get to big, and it fails
I need to investigate whether transrate works on single end reads. Looking at the paper it seems like not, which leads me to think that we shouldn't bother with single end experiments at all (if we can't merge those assemblies and they're going to be more fragmented anyway)
transrate DID successfully run on a transcriptome with ~900k transcripts, which I'm guessing starts to get to about its limit.

These facts lead me to:

there's probably no point in doing one massive co-assembly when there are dozens of input transcriptomes that are quite diverse.

I'm currently running busco on cd-hitted (1.0) merged assemblies to see which one i should move forward with:

the larger assembly groups only assembled with rnaspades (trinity will take too long at this point)
the smaller assembly groups assembled with rnaspades and trinity

Arcadia-Science / 2023-amblyomma-americanum-txome-assembly