bede / hostile

Precise host read removal
MIT License
68 stars 4 forks source link

Selection of appropriate reference genomes (indexes) #36

Open DrYoungOG opened 1 month ago

DrYoungOG commented 1 month ago

Hi, the software seems very good, but I am new in metagenome analysis and I have questions about the selection of appropriate reference genomes (indexes) when using 'hostile'.

1. I read the "Reference genomes (indexes)" part of the README, and my understanding is: compared with using 'human-t2t-hla', additional reads from the '985 reference grade bacterial genomes' will be preserved if 'human-t2t-hla-argos985' is used. Is my understanding right?

2. I have metagenome sequencing data from human stool samples, and I want to analyze the bacteria, archaea, fungi, and virus in these samples. I have used 'fastp' for quality control, and the next step should be host decontamination to remove reads from the host, i.e. humans (am I right?). Can this step be completed using 'hostile'? If possible, how to select reference genomes (indexes)? Should I select 'human-t2t-hla' for my objective?

Thank you!

bede commented 1 month ago

Hi there,

  1. Hostile offers equal or higher retention of bacterial reads than other host read removal approaches with the default index (human-t2t-hla), but using human-t2t-hla-argos985 can slightly improve this for bacterial samples at the cost of a very small reduction in host read removal performance.
  2. I would suggest performing host decontamination first if convenient, though it is unlikely to matter. Fastp can optionally trim sequences, which might reduce the accuracy of subsequent human read removal. For your use case I would recomend using the index human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401 to minimise the number of bacterial, viral and bacteriophage sequences removed. However you should be wary of using any of the current standard Hostile indexes to study fungi, since many eukaroytic genes will map to the human index and thus be removed.
DrYoungOG commented 1 month ago

Thanks for your reply!

Please forgive me for asking some potentially foolish questions:

  1. The more one index is masked, the more sequences from microbes could be preserved, right?
  2. In my case, should I perform host decontamination first, and then do quality control with fastp?
  3. In terms of researching the fungi in metagenome sequencing data from fecal samples, do you have any recommendations for methodologies and softwares?

Thank you!

bede commented 1 month ago

No problem :)

  1. Correct
  2. Ideally yes. If enabled, trimming would reduce the number of human reads removed by Hostile.
  3. I am not sure what is common practice in fungal metagenomics. If possible, you should analyse the data without first removing human reads so that you know what (if anything) is being removed during host decontamination.

I will consider creating a masked fungal human index, though I can't promise anything

DrYoungOG commented 1 month ago

Thanks for your patience. I'm looking forward to the masked fungal human index!