Arcadia-Science / 2023-amblyomma-americanum-txome-assembly

MIT License
0 stars 0 forks source link

Figuring out assembly groups & transcriptome merging #7

Open taylorreiter opened 10 months ago

taylorreiter commented 10 months ago

Right now, assembly groups are a user-input parameter.

Background

This is similar to the co-assembly problem in metagenomics. Anecdotally, co-assembly is especially popular for time series samples.

Initial trial

designated assembly groups: read type (pe or se) + tick origin state + time + tissue + sex + treatment

   assembly_group              n
 1 peaae1cellline                  24
 2 peaae2cellline                  28
 3 peok0sgfemale                    1
 4 peok12sgfemale                   1
 5 peok168sgfemale                  1
 6 peok72sgfemale                   1
 7 petx0wholefemale                13
 8 petx0wholemale                   1
 9 petx120midgutfemale              1
10 petx120sgfemale                  1
11 petx12wholefemaleecoli           3
12 petx24wholefemale                1
13 petx24wholefemaleecoli           3
14 petx3wholefemaleecoli            3
15 petx48midgutfemale               1
16 petx48sgfemale                   1
17 petx6wholefemaleecoli            3
18 petx96midgutfemale               1
19 petx96sgfemale                   1
20 petx96wholefemale                1
21 petxwholefemale                  1
22 petxwholemale                    3
23 seok168wholefemaleinfected       1
24 seok168wholefemaleunexposed      1
25 seok168wholefemaleuninfected     1
26 seok168wholemaleinfected         1
27 seok168wholemaleunexposed        1
28 seok168wholemaleuninfected       1

Only showing rna spade results so it's easier to see image

The assemblies finish fine, but transrate in the orthofuser step and evidential gene both fail, I think because there are ~56 assemblies and that's just too much. This means I have to reduce the number of assemblies we're dealing with in order to deduplicate.

Next steps

I'm thinking of three options:

  1. reduce the assembly groups to read type (pe or se) + tissue + sex This would give:
    assembly_group    n
    1 peaaecellline          52
    2 pemidgutfemale          3
    3 pesgfemale              7
    4 pewholefemale          28
    5 pewholemale             4
    6 sewhole                 6
  2. Co-assemble based on study. This would give:
    study_title                                                                          n
    1 Amblyomma americanum RNA-seq following E. coli treatment                            24
    2 Amblyomma americanum Raw sequence reads                                              6
    3 Amblyomma americanum adult female salivary gland transcriptome                       4
    4 Amblyomma americanum strain:Stillwater Oklahoma Transcriptome or Gene expression    14
    5 Arthropod Cell Line RNA Seq                                                         52
  3. Just co-assembly everything (still will need to be merged because we'll have Trinity & RNAspades). This is ~mostly the ORP and eelpond/elvers approach.

I think all three of these are reasonable solutions. I would like to run them all and compare, but Trinity takes FOREVER to run, so this might be a future investigation. For now, I think I'm going to go with the option 1.

taylorreiter commented 10 months ago

Merging all of the non-isoseq transcriptomes together and clustering them with cd-hit-est -i merged.fa -o merged_cdhit.fa -c 1 -T 6 -M 12000, we see a 19% reduction in the number of transcripts observed. This suggests that there is overlap in the content of some of these transcriptomes. Some of this is to be expected, as each assembly group is assembled twice (once with rna spades and once with trinity), but including just to record another data point.

$ grep ">" merged.fa | wc -l
5985783
$ grep ">" merged_cdhit.fa | wc -l
4839615
taylorreiter commented 10 months ago

Just realized it would be pretty fast to trial with just RNA spades (since this actually runs quickly and doesn't bloat your whole hard drive, unlike trinity), so I going to do some light testing with assembly groups using that and then move forward. Will update here with results as I get them.

taylorreiter commented 10 months ago

Some transrate updates:

These facts lead me to:

I'm currently running busco on cd-hitted (1.0) merged assemblies to see which one i should move forward with: