MrOlm / inStrain

Bioinformatics program inStrain
MIT License
134 stars 33 forks source link

how the instrain should be tuned for a reference genome consisting of high relative abundance strains? #157

Closed adobe7105 closed 10 months ago

adobe7105 commented 11 months ago

Hi @MrOlm, I have some confusion about the parameter settings of the instrain. If you have any suggestion they will be precious.. here is my workflow,First,calculate relative abundance of strains using metaphlan4, then,selected the strains with an average relative abundance greater than 0.5% for SNV annotation to ensure high quality. All selected reference genomes are downloaded from ncbi

With my parameters below, profile: --min_genome_coverage 10 --min_read_ani 0.95 --skip_plot_generation --skip_mm_profiling

Since my resaerch is based on a small reference genome(including 18 species), should I adjust ani to 0.98?

bests, zemin

MrOlm commented 11 months ago

Hi Zemin,

Just a couple of questions to make sure I understand-

1) How similar are the genomes in your reference genome database?

2) Are you mapping to all 18 species at the same time in all samples?

3) Are you using metaphlan4 in order to "prescreen" samples, so that you don't have to run inStrain on all samples?

Best, Matt

adobe7105 commented 11 months ago

Hi @MrOlm, My research focuses on finding signature snv markers in high abundance strains of disease populations。 According to my understanding, the number of snv detected on the gene is related to the relative abundance and sequencing depth of the strain。 Therefore I used the following research process:

  1. metaphlan4 was used to detect relative abundance in diseased samples, selecting strains with an average relative abundance greater than 0.5%.(below)
  2. reference genomes were constructed using the selected strains, and snv annotations were performed following the process you described in the instrain usage instructions(mapping to all 18 species at the same time)
  3. standardized the number of snv using strain relative abundance and strain sequencing depth. (Standardized SNV number = SNV number/(relative abundance*sequencing depth), relative abundance is the abundance of individual strains in each sample, whether or not it is percentile does not affect the standardization, and the sequencing depth is the sequencing depth of the strain=total bases/strain genome size.)https://onlinelibrary.wiley.com/doi/10.1002/imt2.40
  4. Construct a machine learning model using the number of snv on genes (sns+snv) to filter for snv features.
  5. Use the test set to view the model performance. I hope you can give me some guidance, it's really precious for me Prevotella_copri Phocaeicola_vulgatus Phocaeicola_coprocola Phocaeicola_plebeius Phocaeicola_massiliensis Phocaeicola_dorei Bacteroides_stercoris Bacteroides_fragilis Bacteroides_ovatus Bacteroides_uniformis Bacteroides_caccae Bacteroides_thetaiotaomicron Megamonas_funiformis Faecalibacterium_prausnitzii Fusobacterium_mortiferum Parabacteroides_distasonis Ruminococcus_gnavus Alistipes_putredinis best, zemin
MrOlm commented 11 months ago

Hi Zemin,

OK- I now understand your research question.

WIth regards to your original question, I think your command is great and there's no need to adjust the min_read_ani. In your analysis you want to detect SNVs, and adjusting the min_read_ani will just hamper that goal.

The only other comment I have is to many not standardize the number of SNVs detected in that way. The problem is that sequencing depth doesn't always lead to more SNVs detected, so doing that will underestimate the number of SNVs detected in high coverage genomes. If you set a minimum detection depth at 10x coverage, and only look at SNVs at at least 20% abundance, that should go a long way to correcting for biases due to sequencing depth.

Best, Matt