Allele calling threshold

mlarjim commented 2 years ago

Dear chewBBACA team,

I am wondering how chewBBACA determines that an allele is new, does it use a theshold? According to chewBBACA documentation, a new allele is inferred when it does not have an exact match in the schema but are highly similar to loci in the schema. Does this mean that if an allele does not share the 100% of the nucleotides with any allele in the schema, it is classified as inferred, even if there is a single nucleotide difference? And if so, how can we modify this threshold? What does "highly similar" in the above statement mean?

Thank you in advance Best regards, Maria Lara

ramirma commented 2 years ago

Dear Maria Lara,

Thank you for your interest in chewBBACA. As you indicate, chewBBACA will identify a novel allele if an allele in a query sequence is not 100% identical at the nucleotide level with an allele already in the schema. There is no way to change this threshold. ChewBBACA was written with typing in mind and since even a single SNP allows differentiating two isolates. We though of no use case where "ignoring" potentially discriminating information could be useful. This is such a basic principle that it is embedded in all that chewBBACA does so it is impossible to change. As for the sufficient similarity, this is evaluated by the BSR (blast score ratio) value of the protein encoded by the allele and a comparison to the mode size of the alleles in the locus. The default values for these are a BSR>0.6 and a variation in size <20% of the mode, but these can be changed by the user in parameters passed to chewBBACA.

I hope to have clarified your question and that chewBBACA will be useful for your application.

Mario

mlarjim commented 2 years ago

Dear Mario, Thank you so much for your answer. Therefore, how does chewBBACA overcome the problem of sequencing errors? A new allele is inferred whenever it detects a new nucleotide substitution, but this allele may have been born in the sequencing process

ramirma commented 2 years ago

You raise an interesting point. Of course simply counting the number of differences would not allow us to distinguish sequencing errors from true variation so the decision of where to put the line distinguishing these two possibilities is always fraught with uncertainty. Our decision was that the assemblies provided to chewBBACA would be the ground truth, i.e., we would assume that they were accepted by the user to be of sufficient quality (ex: sequencing depth, error rate and assembly quality) for allele calling (another process that may introduce errors that you did not specifically addressed is contamination of the original sample, something which may occur with significant frequency).

chewBBACA has no additional information besides the contigs, so it could never make decisions regarding the quality of a particular sequence. In chewBBCA, garbage in will mean garbage out. Having said this, a few observations should make you cautious about particular genomes, such as a high number of novel alleles or an unusually low number of loci identified, but the solution is to exclude these genomes from the allele calling. In chewBBACA 3.0 you have the option of running the allele call without committing any alleles to the local database. This allows you to run chewBBACA in genomes of uncertain quality without "polluting" your local database with alleles resulting from the problems described above, but, as I said above, ultimately the decision of including such assemblies lies with the user, with chewBBACA not offering any additional way to assist in the decision.

I hope to have clarified your question.

Mario

mlarjim commented 2 years ago

Thank you for your quick and enlightening answer! I find particularly interesting the new option of chewbbaca v3.0 and was not aware of it...will this version be available to install using conda soon?

ramirma commented 2 years ago

@rfm-targa is working on making a conda release soon. We will also make available new documentation detailing all the new features. In the meantime you can always clone the repository!

B-UMMI / chewBBACA

Allele calling threshold #146