MrOlm / inStrain

Bioinformatics program inStrain
MIT License
135 stars 33 forks source link

Question about dRep for getting representative genome database? #189

Open akques opened 1 month ago

akques commented 1 month ago

Hi!

Thanks so much for developing tools like inStrain and dRep, they're very user-friendly. I have some questions regarding building the representative (rep) database. Initially, I used de novo assembled bins to create my database. Following your suggestion, I chose dRep with a fastANI threshold of 98%. However, upon reviewing my non-redundant genomes, I noticed that some of my dereplicated representatives have genomes with fastANI-calculated ANI > 98%. This has left me confused because you mentioned that the rep database should not be too similar, and recommended a threshold of 98%.

Should I run dRep on my rep genomes a second time with the fastANI threshold set to 98%, or is there another approach I should consider?

Additionally, I'm concerned whether using parameters such as --min_read_ani 0.92 with a rep database containing inside ANI > 98% could lead to read mis-mapping. How can I check for potential mis-mapping?

My instrain profile parameters: inStrain profile ${bam} ${genome_database} -o instrain_profile/${sample_id}-vs-allbins.IS -p ${task.cpus} -s ${scaffold2bin_file} -g ${gene_level_profiles} --database_mode My instrain compare parameters: inStrain compare -i ./instrain_profile/*.IS -o genomefile.IS.COMPARE -p 60 -s repgenomes.stb --breadth 0.5 -ani 0.99999 -cov 0.5 --database_mode.

I'm also unsure about the -cov parameter, even after checking the user manual. I see your default setting is 0.1. Could you provide more detailed information about its usage?

Any help would be greatly appreciated!

MrOlm commented 1 month ago

Hello!

Thanks for the kind words.

1) Yes- this can happen as a result of hierarchical clustering. Some more info can be found here- https://drep.readthedocs.io/en/latest/choosing_parameters.html#oddities-of-hierarchical-clustering . You could re-run dRep again (called tertiary clustering) and that can help, but it's also usually not a big deal and doesn't impact too much.

2) You could probably set your --min_read_ani to 0.95 if you really want to avoid mis-mapping, but I also think the default will work out fine too. The biggest sign of mis-mapping is having a very bumpy coverage profile across the genome (some areas with very high coverage and some with very low); you could probably check for this by looking at the distribution of gene-level coverage values. However, inStrain compare is robust to mis-mapping, so you might not need to worry about it too much.

3) The -cov parameter sets the limit of how much of the genome needs to be compared it order for inStrain compare to produce a result. At 0.1, it requires that both samples have at least the same 10% of the genome at sufficient coverage to report a result. I wouldn't set it below 0.1, but you could raise if it you want to be more stringent. The more you raise the value, the higher coverage will be required to compare the genomes.

Best, Matt

akques commented 1 month ago

Hi, @MrOlm,

Thank you so much for your quick reply. Regarding your first answer, I also used dRep to dereplicate specific species genomes with mash ANI 99.99% and fastANI 95% to build its SGB (species-level genome catalogue, as I understand it). I'm wondering if re-running dRep would also be applicable in this situation?

Additionally, as you suggested earlier, I thought incorporating a public database into my inStrain analysis would be beneficial. I believe adding a public database could help capture low abundance MAGs that were not initially binned.

I still have a question about determining whether de novo assembled bins are sufficient. You mentioned calculating the percentage of reads that map to genomes in your database. Could you confirm if the following method is correct for this calculation? Since I have many samples, I am also a little confused about how to calculate it:

Mapped percentage of reads = inStrain genome_wide filtered_read_pair_count / total reads in the mapped sample

Moreover, the Mgnify UHGG database only provides species-level representatives with ANI 95%. How can I obtain a sub-species level Mgnify UHGG representative database? Should I run dRep myself with whole genomes using fastANI 98%, or should I re-run dRep on the existing Mgnify rep genomes at fastANI 98%?

Many thanks :)