NBISweden / aMeta

Ancient microbiome snakemake workflow
MIT License
19 stars 14 forks source link

Pathogen-only projects #84

Closed ZoePochon closed 1 year ago

ZoePochon commented 2 years ago

Sometimes we just want to look for potential pathogens for side-projects where there is no specific interest in the microbiome or when demographics is more the focus than microbes. In this cases it would be more efficient if we could create a list of about 100 most interesting pathogens and only run MALT on individuals of interest to reduce the core-hours use and the storage usage and make the pipeline faster.

clami66 commented 2 years ago

This could be done by making a smaller krakenuniq database, but then you would need to make a database for each project.

Or, since the krakenuniq step is not the most time-consuming,we could add an optional input to filter_krakenuniq where taxids are kept only if they are also included in a user-defined list. This way, you would always run krakenUniq on a full microbial database, but then you would only run MALT on microbes of interest

@LeandroRitter @percyfal what do you think?

ZoePochon commented 2 years ago

So I think it is essential to keep the full database for the prescreening part with KrakenUniq to avoid false positives. The malt step is the heaviest and I thought we could only filter for pathogens at this step using a large list or something and only generate a smaller custom malt database

NikolayOskolkov commented 2 years ago

@ZoePochon I believe you need HOPS for this purpose :) We have a Bowtie2 index for PathoGenome. I could build a MALT database from the PathoGenome fasta-file. I would not "detect" pathogens in side project samples based on such a MALT DB but use this only as a validation step

ZoePochon commented 2 years ago

I am not sure if we mean the same. I would like to use the same method as the pipeline uses so: first detection with KrakenUniq but then authentication only for the pathogens if I don't care about other microbes. Basically just trying to save core-hours here

NikolayOskolkov commented 2 years ago

Ok @ZoePochon, I see, I misunderstood you. I would still not use such a MALT DB built on pathogens because there is a risk that reads originating from other non-pathogenic species will be forced to map to the pathogenic reference genomes. It would actually be interesting to demonstrate the inflation of "pathogenic" counts when using a MALT DB built on pathogens only. We indirectly show this in the Supplementary Figure 3 of our manuscript (when mapping reads to Y.pestis alone), but perhaps in addition one can build two MALT DBs: one with all detected species, and the other one with pathogens only, and demonstrate that the pathogenic counts become inflated in the latter case. I think I will do it when I have time. Pathogen-enriched database are highly biased and should not be used, in my opinion

ZoePochon commented 2 years ago

Okay, I see your point and I think you are right. So we would need to build a database encompassing all the microbes found after the filtering of KrakenUniq. But potentially we would not need to proceed further to Malt and authentication for the samples where there was no pathogen. We could exclude these samples from the downstream analysis to reduce the core-hour usage. What do you think ? I don't know if that's even possible to do with snakemake.

LeandroRitter commented 1 year ago

I close this for now since we have discussed this a few times and I believe we have reached an agreement