BackofenLab / CRISPRcasIdentifier

Machine learning for accurate identification and classification of CRISPR-Cas systems
GNU General Public License v3.0
20 stars 6 forks source link

Interpretation of the prediction.csv #1

Closed yjiakang closed 4 years ago

yjiakang commented 4 years ago

Thanks for your usefull tool, here I am confused about the meaning of the prediction.csv. Hoping for your reply. Best

padilha commented 4 years ago

Hello @yjiakang. Sorry for taking some time to respond.

prediction.csv has 5 columns (HMM, cassette_id, classifier, regressor and predicted_label).

It corresponds to the combinations and their output. For example, consider the example below (which was run using a DNA fasta file as input):

HMM,cassette_id,classifier,regressor,predicted_label HMM1,1,ERT,ERT,"[('CAS-I-C', 1.0)]" HMM1,2,ERT,ERT,"[('CAS-I-F', 1.0)]" HMM3,1,ERT,ERT,"[('CAS-II-C', 0.98), ('CAS-III-D', 0.01), ('CAS-V-A', 0.01)]" HMM3,2,ERT,ERT,"[('CAS-I-C', 1.0)]" HMM3,3,ERT,ERT,"[('CAS-I-F', 1.0)]" HMM5,1,ERT,ERT,"[('CAS-I-C', 1.0)]" HMM5,2,ERT,ERT,"[('CAS-I-F', 1.0)]"

From this example, we can see that our tool found 2 cassettes when using HMM1, 3 cassettes when using HMM3 and 2 cassettes when using HMM5.

Let's take the fourth row as an example. In this case, it means that HMM3 + ERT regressor + ERT classifier predicted that cassette 1 may belong to CAS-II-C, CAS-III-D and CAS-V-A with probabilities of 0.98, 0.01 and 0.01, respectively. Also note that the cassette ids must be considered with the HMM set that generated them. Thus, cassette 1 + HMM1 may not be the same as cassette 1 + HMM3, for example.

The cassettes can be found in output/cassettes (if you are using the default output path).

yjiakang commented 4 years ago

Thanks for your useful reply. In this example, is that mean we totally got 2+2+3=7 CRIPSR-Cas cassettes?

padilha commented 4 years ago

In this case, yes. Note that we obtained 2 cassettes for the set HMM1, 3 cassettes for the set HMM3 and 2 cassettes for the set HMM5. Then, you can check output/cassettes (or the directory that you are using as output) to see which HMM set is better.