MrOlm / inStrain

Bioinformatics program inStrain
MIT License
134 stars 33 forks source link

Genome breadth of coverage #142

Closed Liuyuxinn closed 1 year ago

Liuyuxinn commented 1 year ago

Hi.

To determine which genomes present and absence in my samples, I mapped reads from each sample to reference genomes from publish database.

And strain presence inferred by assessing the level of genome breadth of coverage that using inStrain quick_profile. If one genome with 30% breadth of coverage or above ,I conside it was “present” in this sample.

Which parameter I should choose to use (--breadth_cutoff 0.3 or --stringent_breadth_cutoff 0.3)? And what are their differences?

However, when I set --breadth_cutoff 0.3 or --stringent_breadth_cutoff 0.3, I found the inStrain quick_profile outputs (genomesCoverage.csv) the results for all genomes, even if the genome breadth of coverage is below 0.3 at one sample.

Does it mean that I can use all genomes that appear in genomesCoverage.csv ? Or should I need to filter out the genomes with very fewer breadth ( < 0.3) at one sample?

And I am wondering the inStrain quick_profile outputs results meaning (genomesCoverage.csv, coverm_raw.tsv,scaffolds.txt)?

Thanks.

MrOlm commented 1 year ago

Hi @Liuyuxinn ,

The stringent_breadth_cutoff is used as a pre-filtering step. A quick estimate is run very early in the program to guess at the breadth of each scaffold and only include scaffolds that pass this breadth. It's really only useful to make the program run a little bit faster, but I would strongly recommend keeping this at it's default value since it will probably only run seconds faster anyways/

The breadth_cutoff is only useful if you plan on using the scaffolds.txt output file, as it puts all of the scaffold of genomes with breadth over that threshold in that file.

I recommend leaving both of these values at the default, and then just filtering the output to your preferred breadth cutoff.

The files your listed are the correct outputs.

Best, Matt