dib-lab / 2021-paper-metapangenomes

Other
1 stars 2 forks source link

brainstorm/asking for help on baking results section 3: k-mer-based metapangenomics combined with assembly graphs reveal strain dynamics #5

Closed taylorreiter closed 2 years ago

taylorreiter commented 2 years ago

State of the paper

I'm super happy with updates to the first 3/4s of the results based on feedback from @ctb. Main highlights:

But i'm struggling with updating the last section, "k-mer-based metapangenomics combined with assembly graphs reveal strain dynamics"

I like the first two paragraphs:

But then I don't like the next two paragraphs:

Part of the reason i don't like it is it doesn't come to a nice, concrete ending, it just sort of fades away, and it leaves me unsure of what the section was trying to achieve, which isn't a great argument for the general utility of the method.

Brainstorming alternatives for results section 3

I've thought a lot about how this section could be improved, but I haven't come to any conclusions and think I'm at the point where outside perspective/feedback/brainstorming would be super helpful.

I have these cute strain plots: image image image

Which show more clearly the "strain switches" that I was able to sniff out visually from this plot image

But, one thing I don't like about the cute strain plots is the "other" section of the plot -- it potentially raises a lot of questions I don't have answers for. It contains all of the other organisms found by gather, and they're all different species than the one thats represented in the plot. I don't know if that's spill-over from spacegraphcats queries and it truly belongs to other species in the metagenome, or if it's real strain content for the species that's present and just doesn't totally match what's in the database.

I dabbled with functional identification, too. I took all the GTDB proteomes that tessa produces for P. merdae, sketched them with protein -p k=10,scaled=100,protein --singleton, indexed them, and ran gather on each of my nbhds. It worked pretty well...this is the fraction of sequences in the neighborhood that were covered by gather:

   sample   sum_f_unique_weighted
 1 HSM7CYYB                 0.940
 2 HSM67VF9                 0.930
 3 HSM7CYY7                 0.887
 4 HSM67VFD                 0.883
 5 HSM7CYY9                 0.880
 6 HSM67VFJ                 0.878
 7 HSM7CYYD                 0.861
 8 HSM6XRQI                 0.767
 9 HSM6XRQB                 0.751
10 HSM6XRQO                 0.746
11 HSM6XRQK                 0.732
12 HSM6XRQM                 0.717

And this is a histogram of the number of annotations (gather database name) shared by number of samples

image

And lastly, I made roary pangenomes for each of these species using GTDB proteomes, but I haven't used them to do anything because it's not clear to me how this is value added above the containment stuff I already did in section 1.

Current goals for the section -- may need to be expanded or changed

Halp

So...with all of this, @ctb and @bluegenes, I would love any feedback/thoughts/hot takes/deep takes etc. that you may have.

taylorreiter commented 2 years ago

I decided to compare the amino acid metapangenomes against the roary pangenomes (using GTDB genomes) and against de novo assembled and binned MAGs from the same metagenomes. The results are not super intuitive, but even once i grok'd them i'm not sure they're super promising?

Core

example core (meta)pangenome compare matrix:

Parabacteroides_distasonis:
kaa_core, roary_core, metabat2_core
1.0, 0.6074476338246703, 0.6959022286125089
0.10259433962264151, 1.0, 0.1520488856937455
0.25366876310272535, 0.3281613653995345, 1.0

processed and summarized core (meta)pangeome compare matrices for all six species -- I think this should read compset in set

  compset       set           mean_containment sd_contaiment
1 kaa_core      metabat2_core           0.723          0.166
2 kaa_core      roary_core              0.703          0.201
3 metabat2_core kaa_core                0.0838         0.119
4 metabat2_core roary_core              0.133          0.172
5 roary_core    kaa_core                0.142          0.108
6 roary_core    metabat2_core           0.223          0.170

total

example total (meta)pangenome compare matrix:

Parabacteroides_distasonis:
kaa_all, roary_all, metabat2_all
1.0, 0.9674166020170675, 0.8922991842942071
0.046425912137006704, 1.0, 0.08426655606249137
0.48056589724497395, 0.9456943366951125, 1.0

processed and summarized core (metapangeome) compare matrices for six species-- I think this should read compset in set

 compset      set          mean_containment sd_contaiment
1 kaa_all      metabat2_all           0.771        0.114  
2 kaa_all      roary_all              0.971        0.00947
3 metabat2_all kaa_all                0.435        0.117  
4 metabat2_all roary_all              0.875        0.220  
5 roary_all    kaa_all                0.0760       0.0686 
6 roary_all    metabat2_all           0.116        0.104
taylorreiter commented 2 years ago

upset plot

Ok containment is more intuitive as an upset plot. This is for parabacteroides dist.

image

results section 3 outline

this matches more with the overall story that's coalescing in my head for this section.

1) here's our data 2) here's what we did to it 3) kaa-mer metapangenomes contain the majority of reference-based or de novo-based sequences, but capture additional sequencing variation not recovered by these other methods

One check I could do that wouldn't be a horrible idea is map the kaa-mer reads against the reference pangenome and see what's left over. Check how of the leftovers map against the bins. And then check again and see how many reads are leftover. I guess this would show that the kaa-mer method does the best job of capturing de novo + reference based pangenome variation. Not sure what I would do with the leftover reads though? I could do it all again in amino acid space and show that even more reads map, demonstrating that the kaa-mers carry SNPy reads. Again, what to do with the leftover pile though. Assemble and annotate? IDK.

Discussion points

And some discussion points that have come up for me:

taylorreiter commented 2 years ago

closed by https://github.com/taylorreiter/2021-paper-metapangenomes/commit/3cce01dc24c20e080410349b50862e81e2fba840 🎉