czbiohub-sf / MIDAS

Metagenomic Intra-Species Diversity Analysis (MIDAS)
MIT License
35 stars 10 forks source link

Query regarding specificity of the tool #137

Closed sahilrishav2 closed 3 months ago

sahilrishav2 commented 4 months ago

Hello,

I am performing benchmarking of some tools in which I have included MIDAS2. I have performed analysis on my simulated datasets (10M paired-end reads of 1400 strains) out of which it could detect only 40 strains. Could you explain to me the reason behind this? The database I used was GTDB.

Thank you

zhaoc1 commented 4 months ago

Hi,

Please refer to the single sample SNV for the detailed explanation. In short, MIDAS2 only genotype species meeting the filter specified by --select_by and --select_threhold. Users can adjust these two parameter to decide the list of species they want you genotype.

If you want to genotype all the strains in your simulated reads, you can also follow build you own genome index to achieve this goal.

Thanks, Chunyu

sahilrishav2 commented 4 months ago

Hi,

Actually, In building the own genome index, some error persists, so, I downloaded the GTDB customized by midas2 and those 1400 strains are available in GTDB.

zhaoc1 commented 4 months ago

What is your error message? Is it from MIDAS2?

sahilrishav2 commented 4 months ago

Oh really sorry, I sent you a different error message. Actually, the error i got is not available currently

sahilrishav2 commented 4 months ago

It was days ago, so, i don't have error report. then i decided to download the GTDB customised by MIDAS2 to perform the analysis.

sahilrishav2 commented 4 months ago

I want to ask if I am using GTDB and those strains are available in the database then why the tool can detect only a few strains?

zhaoc1 commented 4 months ago

Sure. Check the first link I sent to you. That answers your question.

sahilrishav2 commented 4 months ago

Thank you so much for your support

sahilrishav2 commented 4 months ago

Hi,

sorry to bother you again. I want to say that even after using this command midas2 run_snps --sample_name synthetic_reads_10M -1 synthetic_reads_10M_R1.fastq -2 synthetic_reads_10M_R2.fastq --midasdb_name gtdb --midasdb_dir /home/rishabh/my_midasdb_gtdb --num_cores 40 --select_by median_marker_coverage,unique_fraction_covered --select_threshold=2,0.5 midas2_output_1355_strains_reads_10M . Only few strains were detected.

cat snps_summary.tsv 
species_id  genome_length   covered_bases   total_depth aligned_reads   mapped_reads    fraction_covered    mean_coverage
100071  7532486 6776814 207976720   1603677 1464870 0.900   30.689
100259  6534206 6330743 47322768    361190  333329  0.969   7.475
100148  6150048 5962769 60054377    607189  422886  0.970   10.072
100272  6028589 5665629 39938776    408227  281253  0.940   7.049
100338  6345256 5342492 32115412    325309  226136  0.842   6.011
100057  5962738 5370418 121980244   961247  859046  0.901   22.713
100250  5076188 4935426 64085233    497484  451340  0.972   12.985
100074  5657766 5330351 37765574    353582  265958  0.942   7.085
100078  5710863 5050473 18761658    180068  132137  0.884   3.715
100048  4990672 4792836 314936606   2388216 2218247 0.960   65.710
100115  3777657 3776607 110315594   783730  776975  1.000   29.210
100153  2607257 2504168 74563049    579688  525211  0.960   29.776
sahilrishav2 commented 4 months ago

@zhaoc1 , Please suggest some solution to it.

zhaoc1 commented 4 months ago

--select_by median_marker_coverage,unique_fraction_covered --select_threshold=2,0.5 This mean you only genotype species meeting these two filters based on the species/species_profile.tsv. And it is expected to be only a few species.

--select_threshold=-1 and --species_list /path/to/the list of species you want to genoytpe This might be what you want.

Refer to this link for more details: https://midas2.readthedocs.io/en/latest/species_module.html#abundant-species-selection

Thanks. Chunyu

sahilrishav2 commented 3 months ago

Hi @zhaoc1 ,

Even after using this command midas2 run_snps --sample_name synthetic_reads_10M -1 synthetic_reads_10M_R1.fastq -2 synthetic_reads_10M_R2.fastq --midasdb_name gtdb --midasdb_dir /home/rishabh/my_midasdb_gtdb --num_cores 40 --select_threshold=-1 midas2_output_1355_strains_reads_10M --species_list 100071 to identify the strains of a single species. It was able to identify only one strain:

cat snps_summary.tsv 
species_id  genome_length   covered_bases   total_depth aligned_reads   mapped_reads    fraction_covered    mean_coverage
100071  7532486 6790769 213971382   1760066 1507106 0.902   31.509
zhaoc1 commented 3 months ago

Hi,

You specify species-list 100071, and this means you only want to run snps for 100071.

Get a txt file with the list of the species ids you want to genotype and pass the file name to species-list.

Chunyu

sahilrishav2 commented 3 months ago

Hi @zhaoc1 ,

I was just checking how many strains of species list 100071, tool was able to detect. In my metagenome reads, I have 236 strains of this species list 100071 and those 236 strains are also available in GTDB but the tool could identify only one strain.

zhaoc1 commented 3 months ago

MIDAS2 only do genotyping the SNPs profile. If you want to infer strains, it is out of scope of MIDAS2. Try https://github.com/bsmith89/StrainFacts

sahilrishav2 commented 3 months ago

Ok, thank you for the clarification. I thought that MIDAS2 could detect strains through snv method as in the paper this statement claims so

MIDAS2 performs strain-level metagenomics analysis through a SNV module and a CNV module

Thank you very much for your time