Call of NIPH or NIPHM not reproducible

AleSR13 commented 2 years ago

Hello!

First of all, thanks for building and maintaining this tool! It is very well documented and easy to use!

I have one question/concern about the output. For my work, I need to validate each method we use, which includes evaluating reproducibility of results. For this reproducibility analysis, I used 278 Salmonella samples and few small sets (9-15 samples) of other genera. I analyzed each dataset 3 times using ChewBBACA. For my large Salmonella dataset and for a small (n=10) Shigella dataset I have inconsistencies in different repeats, all related to NIPH and NIPHM calls. Basically in one repeat some loci are called as NIPH(M) and in the other, they do get an allele ID assigned or (in few cases) they are detected as ASM or ASL. The problem happened more often with NIPHM calls but also a couple of times with NIPH ones. I wonder whether you have an insight on this.

I add more information just in case:

Installed ChewBBACA 2.8.4 through conda in an HPC cluster

My command to run chewbbaca was:

chewBBACA.py AlleleCall --cpu 10 \
            -i "${input_files}" \
            -g "${prepared_scheme}" \
            -o "." \
            --fr

For Salmonella I used the Enterobase scheme and I noticed that in more than 1 sample, there were problems with the loci STMMW_13101 and STMMW35021. In both cases multiple samples were called as NIPH or NIPHEM in one repeat and a different allele in the other repeats.
For Shigella I used the Enterobase scheme for E. coli and also the one from SeqSphere. In both cases I had few loci that had the same issue. For the Enterobase scheme they were the loci b0018 and b0597. For the SeqSphere scheme the loci were ECx0636, ECs0637, ECs0919, ECs0979, ECs0981, ECs1263 and ECs1420.

I hope you have a good suggestion/insight. Thanks in advance!

AleSR13 commented 2 years ago

Small update. I was afraid that I had somehow corrupted my database, especially because it seems than from my 3 repeats, the first one is usually the one that is off. However, after repeating all the analyses with a newly downloaded and prepared scheme, the problem reappears but in other datasets and with other loci. I don't think this helps much to find the issue but at least it does not seem to be bound to one particular locus/scheme. Btw, this time, one of the errors was with a PLOT allele, not with a NIPH(EM). I also looked at one of the alleles marked as NIPHEM to see whether I could find it in the assembly. I did find it once only.

rfm-targa commented 2 years ago

Hello @AleSR13,

Thank you for your interest in chewBBACA and apologies for such a delayed response. External schemas such as the ones you have used might not follow the same rules enforced by chewBBACA. When the schemas are adapted with the PrepExternalSchema process, some loci and alleles might be excluded because they do not correspond to complete coding sequences (CDSs). The alleles in the external schemas might also be valid, but chewBBACA uses Prodigal to predict CDSs, which might lead to the identification of slightly smaller or larger open reading frames if the start codon chosen by Prodigal is not the same used in the CDSs of the external schema. These two aspects explain why some loci from external schemas are not accepted by chewBBACA and why you might get novel alleles assigned when the schema already contains an allele that is contained in or contains the novel allele. One important feature of chewBBACA is that it identifies novel alleles in input assemblies and adds them to the schema. This can have some side effects that lead to some inconsistencies between different allele calls with the same dataset. When chewBBACA adds novel alleles, it must recompute the allele size mode value for the loci with novel alleles (the size of each novel allele is added to a list with the size of all distinct alleles for a locus). This changes the mode value used to identify ASM and ALM cases, possibly changing the classification of alleles that were previously assigned valid allele identifiers (some alleles that were identified in the first input genome assemblies and assigned valid identifiers might be classified as ASM or ALM in the next allele call because the alleles identified in the other input assemblies changed the mode value). chewBBACA will also identify new representative alleles that are added to the "short" directory inside the schema folder. The representative alleles are the most divergent alleles for each locus and are used to search for novel alleles through alignment with BLASTp. Representative alleles that are identified in input assemblies are not used to search for novel alleles in input assemblies that have been processed in previous iterations of the same AlleleCall process. This can lead to the identification of novel alleles in the next allele call with the same dataset because we now have new representative alleles that will be aligned against the dataset, possibly identifying novel alleles. This can lead to inconsistent results between allele calls with the same dataset, especially to NIPH and NIPHEM cases when the representative alleles that are added to the schema allow to identify multiple alleles for the same locus in a single genome assembly. You will stop getting slightly different results between allele calls with the same dataset when you have identified all the novel alleles in that dataset (when chewBBACA does not infer any novel alleles, INF classifications). This means that to get stable results with a dataset you might need to perform allele calling more than once (when it stops identifying novel alleles, the mode value and the set of representative alleles will not change). I hope that my explanation about how chewBBACA works has provided the answers you were looking for.

Best regards,

Rafael

ramirma commented 2 years ago

Indeed, so sorry for the belated response. I had told Rafael that I would respond and never did, so it is really my fault. Just to add to Rafael's excellent clarification, when you use well established and frequently used schemas, the issues Rafael discussed should be greatly minimized or not occur at all, since there is already a great diversity of identified alleles. In these circumstances, it is unlikely that allele calling on a set of strains will cause dramatic changes in the representative alleles or in the mode size. This occurs mostly when you are defining a schema de novo and our suggestion is always to perform allele calling on as many strains as possible with a new schema, before employing it in new typing analyses. We thank you for your interest in chewBBACA. Mario

AleSR13 commented 2 years ago

Thanks for the clear explanation to you both! I had already figured the inconsistencies with ALM and ASM due to the re-calculation of the mode but I had not thought about the consequence that the change in the representative allele would have in the allele calls. Your explanation is very clear and it has helped a lot already. I do get that that could cause new alleles to be identified (and labelled as NIPH). However, they would not be NIPHEMs, right? I assume those would be identified every time even if the representative changes. Also I am not sure I understand how they could be identified as PLOT? Am I maybe failing to see how this would happen? I guess I have a view that is too naïve about your algorithm so I might just be missing some of the insight on what and how this happens.

ramirma commented 2 years ago

Glad that you had already made some progress and that our responses shed some further light on the issue. It is hard to comment on the changes and issues you raise without having the actual data. What I suspect may be happening with both NIPH and NIPHEM is that a change in locus mode is causing some new loci to be considered and, as new alleles are included in the database, occasionally a recently identified new allele will be matched to something that had been classified as NIPH before. But I am really hand-waving here without carefully looking at the data. The PLOTs are simply alleles that are too close to the contigs edges when comparing to the mode of that locus. In these cases we are uncomfortable that chewBBACA could be calling smaller alleles than it should because of that and hence tag them as PLOT. Again if the mode changes, PLOTs could also change... I hope to have given you some clues as to what may be happening. As an overall advice for consistency, it is best to use well curated and populated schema, which should result in much more robust allele calls. However, due to the addition of alleles during a chewBBACA run, there is always and element of potential variability. We are implementing some changes in chewBBACA 3.0 to minimize this, but it is an intrinsic aspect of the process which will be impossible to eliminate completely.

B-UMMI / chewBBACA

Call of NIPH or NIPHM not reproducible #115