The difference prediction results between the Protein and DNA input

limeng849 commented 3 years ago

Hello, very convenient tools to detect cas cassette. I have encountered a problem, when I using this tool to predict the cas cassette in the MAGs, I found that the amount of predict cas genes are different when I input the different format file(protein or DNA). Is that because of the software may assume the input protein fasta file contains only one cassette?

But I can't get the DNA sequence or sequence ID of the predictions of the DNA input. Meanwhile, I wonder why using different regressors and classifiers will produce that same prediction results? or Is that I should use different parameters like -r CART and -c ERT ( I use the same like -r CART -c CART )

Hope to get your reply, thanks a lot!!

padilha commented 3 years ago

Hello @limeng849

When you provide a protein fasta file as input, CRISPRCasIdentifier assumes that this file consists of a cassette (i.e., it will not try to find the cassette inside a set of proteins. Thus, an input file will consist of a small number of proteins, typically something between 3 to 15 or 20 proteins.). It will label the proteins based on the HMM models and proceed with prediction of potentially missing proteins and classification.

When you provide a DNA fasta file as input, CRISPRCasIdentifier will try to find the cassettes inside the organism. The tool performs this task based on a very simple procedure. It extracts the proteins with Prodigal and builds the cassettes based on a minimum number of proteins next to each other, a maximum number of unknown proteins between two Cas proteins (which is called a "gap") and a maximum nucleotide distance between neighboring genes.

Therefore, note that CRISPRCasIdentifier was not designed to search cassettes inside a set of proteins. If you want to connect the DNA id with the Protein ID, then you may try using Casboundary. This tool searches for cassettes inside a DNA input file, based on HMM searches and an optimization-based Machine Learning procedure. It can be easily integrated with CRISPRCasIdentifier (look at the end of the README file).

If you still want to search for cassettes inside a large protein fasta file, you may try using CRISPRloci. We implemented a modified version of Casboundary in it to deal with protein fasta files. In addition, CRISPRCasIdentifier is also integrated as one of the last steps of the pipeline. There is even a standalone version available here.

Even though the ML models have different biases and are based on different assumptions, they are not supposed to always return different classification results. Considering that the classification of some cassettes usually rely on a subset of key genes, the predictions won't vary that much in many cases (especially because we have some evidence that our classifiers are driven by signature genes, even though we didn't provide any type of information about which genes are considered signatures to which CRISPR subtypes during training. We discuss it in the paper). On the other hand, for very noisy cases, it may happen some variation between different models.

Finally, about the -r and -c parameters, you can vary them as you prefer. It is not mandatory to use the same model for -r and -c.

limeng849 commented 3 years ago

Thank you for your kindly suggestions👍🏻👍🏻👍🏻👍🏻👍🏻

padilha commented 3 years ago

You're welcome! If you have any problems, feel free to open another issue. :-)

In addition, as a reminder from our results, we recommend using the ERT ML model, since it achieved the best results in general.

BackofenLab / CRISPRcasIdentifier

The difference prediction results between the Protein and DNA input #3