brainstorm/asking for help on baking results section 3: k-mer-based metapangenomics combined with assembly graphs reveal strain dynamics

taylorreiter commented 2 years ago

State of the paper

I'm super happy with updates to the first 3/4s of the results based on feedback from @ctb. Main highlights:

introduction to results section/summary of what we did
figure summarizing what we did
inclusion of k = 31 dna for pangenome estimation (nice illustration of how it's worse, so that's good)
better justification for using protein k = 10
containment of core sequences between pangenome methods

But i'm struggling with updating the last section, "k-mer-based metapangenomics combined with assembly graphs reveal strain dynamics"

I like the first two paragraphs:

hey these methods work! Let's combine them with this one time series data set
we'll look at these 6 species that are present at ~all of the time points, using assembly graph queries to get the reads, orpheum to predict open reading frames, and amino acid pangenomes

But then I don't like the next two paragraphs:

We can estimate core/shell/cloud
We find some patterns that are interesting, around abx administration
Hey look, strain switching in two genomes!

Part of the reason i don't like it is it doesn't come to a nice, concrete ending, it just sort of fades away, and it leaves me unsure of what the section was trying to achieve, which isn't a great argument for the general utility of the method.

Brainstorming alternatives for results section 3

I've thought a lot about how this section could be improved, but I haven't come to any conclusions and think I'm at the point where outside perspective/feedback/brainstorming would be super helpful.

I have these cute strain plots:

Which show more clearly the "strain switches" that I was able to sniff out visually from this plot

But, one thing I don't like about the cute strain plots is the "other" section of the plot -- it potentially raises a lot of questions I don't have answers for. It contains all of the other organisms found by gather, and they're all different species than the one thats represented in the plot. I don't know if that's spill-over from spacegraphcats queries and it truly belongs to other species in the metagenome, or if it's real strain content for the species that's present and just doesn't totally match what's in the database.

I dabbled with functional identification, too. I took all the GTDB proteomes that tessa produces for P. merdae, sketched them with protein -p k=10,scaled=100,protein --singleton, indexed them, and ran gather on each of my nbhds. It worked pretty well...this is the fraction of sequences in the neighborhood that were covered by gather:

   sample   sum_f_unique_weighted
 1 HSM7CYYB                 0.940
 2 HSM67VF9                 0.930
 3 HSM7CYY7                 0.887
 4 HSM67VFD                 0.883
 5 HSM7CYY9                 0.880
 6 HSM67VFJ                 0.878
 7 HSM7CYYD                 0.861
 8 HSM6XRQI                 0.767
 9 HSM6XRQB                 0.751
10 HSM6XRQO                 0.746
11 HSM6XRQK                 0.732
12 HSM6XRQM                 0.717

And this is a histogram of the number of annotations (gather database name) shared by number of samples

And lastly, I made roary pangenomes for each of these species using GTDB proteomes, but I haven't used them to do anything because it's not clear to me how this is value added above the containment stuff I already did in section 1.

Current goals for the section -- may need to be expanded or changed

demonstrate that we can combine all of these things and use them to do something nifty
illustrate community dynamics in response to antibiotic treatment
highlight detection of strain switching

Halp

So...with all of this, @ctb and @bluegenes, I would love any feedback/thoughts/hot takes/deep takes etc. that you may have.

taylorreiter commented 2 years ago

I decided to compare the amino acid metapangenomes against the roary pangenomes (using GTDB genomes) and against de novo assembled and binned MAGs from the same metagenomes. The results are not super intuitive, but even once i grok'd them i'm not sure they're super promising?

Core

example core (meta)pangenome compare matrix:

Parabacteroides_distasonis:
kaa_core, roary_core, metabat2_core
1.0, 0.6074476338246703, 0.6959022286125089
0.10259433962264151, 1.0, 0.1520488856937455
0.25366876310272535, 0.3281613653995345, 1.0

processed and summarized core (meta)pangeome compare matrices for all six species -- I think this should read compset in set

  compset       set           mean_containment sd_contaiment
1 kaa_core      metabat2_core           0.723          0.166
2 kaa_core      roary_core              0.703          0.201
3 metabat2_core kaa_core                0.0838         0.119
4 metabat2_core roary_core              0.133          0.172
5 roary_core    kaa_core                0.142          0.108
6 roary_core    metabat2_core           0.223          0.170

total

example total (meta)pangenome compare matrix:

Parabacteroides_distasonis:
kaa_all, roary_all, metabat2_all
1.0, 0.9674166020170675, 0.8922991842942071
0.046425912137006704, 1.0, 0.08426655606249137
0.48056589724497395, 0.9456943366951125, 1.0

processed and summarized core (metapangeome) compare matrices for six species-- I think this should read compset in set

 compset      set          mean_containment sd_contaiment
1 kaa_all      metabat2_all           0.771        0.114  
2 kaa_all      roary_all              0.971        0.00947
3 metabat2_all kaa_all                0.435        0.117  
4 metabat2_all roary_all              0.875        0.220  
5 roary_all    kaa_all                0.0760       0.0686 
6 roary_all    metabat2_all           0.116        0.104

taylorreiter commented 2 years ago

upset plot

Ok containment is more intuitive as an upset plot. This is for parabacteroides dist.

results section 3 outline

this matches more with the overall story that's coalescing in my head for this section.

1) here's our data 2) here's what we did to it 3) kaa-mer metapangenomes contain the majority of reference-based or de novo-based sequences, but capture additional sequencing variation not recovered by these other methods

maybe somehow highlight the samples that our method pulled sequences out when binning failed? 4) using our method, we see dynamics in the presence of species and strains in response to antibiotic administration
species ex: disturbance succession by E. bolteae
strain exes: p dist, b uni, p vulg
maybe somehow show that the bins alone or the reference pangenome alone miss some of this stuff (i don't know that they do for sure, still need to test this)

One check I could do that wouldn't be a horrible idea is map the kaa-mer reads against the reference pangenome and see what's left over. Check how of the leftovers map against the bins. And then check again and see how many reads are leftover. I guess this would show that the kaa-mer method does the best job of capturing de novo + reference based pangenome variation. Not sure what I would do with the leftover reads though? I could do it all again in amino acid space and show that even more reads map, demonstrating that the kaa-mers carry SNPy reads. Again, what to do with the leftover pile though. Assemble and annotate? IDK.

Discussion points

And some discussion points that have come up for me:

next steps: can we use kaa-mer abundance information to guess whether there are one or more strains present at a given time/in a given sample?
even using a k = 51, in the strain plots, we see a substantial fraction of sequences falling in the "other" category. most of this stuff is of the same genus as the species that is plotted, but it's not clear what that stuff is. It could be HGT, it could be that we capture things beyond 95% similarity threshold. so a next step here is to determine the organization of assembly graphs and understand what is really being returned

taylorreiter commented 2 years ago

closed by https://github.com/taylorreiter/2021-paper-metapangenomes/commit/3cce01dc24c20e080410349b50862e81e2fba840 🎉

dib-lab / 2021-paper-metapangenomes