cmkobel / CompareM2

🦠📇 Microbial genomes-to-report pipeline
https://CompareM2.readthedocs.io
GNU General Public License v3.0
52 stars 3 forks source link

Reuse checkm2s output in kegg_pathway #65

Closed cmkobel closed 8 months ago

cmkobel commented 11 months ago

I am currently running all samples one-by-one through rule kegg_diamond in order to create the table that links genes to KOs. Wouldn't it be smart to reuse the output from checkm2 now when I'm going to use the same (or better) filtering thresholds anyway. Of course this means that if checkm2 fails, the pathway analysis will fail. But that shouldn't be a big problem and you can always make it so it can run both ways but prioritizes one over the other. I think Aviary has a way of accomplishing this.

This should be a significant speedup for the pathway enrichment analysis. Makes the pipeline fit for larger datasets (thousands of genomes).

cmkobel commented 11 months ago

But. The problem is that checkm2 runs the complete everything through diamond/uniref, whereas I'm only interested in the genes called by prodigal/prokka. If I just re-use the checkm2 result I'll get spurious results because some regions might not be expressed as genes - this is the problem that prodigal solves because it only calls regions that are putatively expressed. Another problem is that if other annotators are added, like bakta or eggnogg, then the user should be able to select these for producing the pathway enrichments so that the gene ids are congruent between different analyses.

So really, my conclusion for this issue so far is that I should make the speed up instead. The speed up concatenates all assemblies together, so the database will only have to be read once. But, on the other hand - Having a beefy computer with a lot of memory might solve the problem as it will keep the database in memory..

cmkobel commented 10 months ago

Idea: I think the GSEA (Gene set enrichment analysis) should, for each sample, show what is unique to each sample. That would be helpful.

cmkobel commented 8 months ago

I think it is a bad idea, since the annotations then won't follow the called genes from prokka.