aldro61 / kover

Learn interpretable computational phenotyping models from k-merized genomic data
http://aldro61.github.io/kover/
GNU General Public License v3.0
50 stars 14 forks source link

How to predict new isolates with saved fasta model? #65

Closed augustkx closed 2 years ago

augustkx commented 2 years ago

Hi, is it possible?

Thanks very much!

aldro61 commented 2 years ago

Hi Kaixin,

Yes, this is possible. Kover doesn't have this feature built-in, but you can do this using the saved model files, which contain all the information you need to apply the model to new data. Given the model.fasta, you can use the k-mer counting software of your choice to test the presence/absence of the k-mers used by the model and use the decision logic specified in the file to obtain the final prediction.

For conjunctions/disjunctions

You can see an example model file here. This file tells you which k-mers are used by the model and if it's their presence/absence that is used to compute predictions. Unfortunately, for conjunction/disjunction models, the fasta file doesn't specify the nature of the model so you need to check the "best_hp" key in the results.json produced by Kover to determine if it was a conjunction or disjunction.

For decision trees

You can refer to this tutorial. The header in the fasta file is a bit complex, but basically it tells you for each rule (which corresponds to a decision node in the tree), what are its children (either the rule ID or the configuration of the leaf, depending on the childs nature). You can see visually what this corresponds to further down in the tutorial.

I realize that this is complicated. Please don't hesitate to reach out if you need more guidance!

Alex

augustkx commented 2 years ago

Hi Alex,

Thanks very much! Very clear, and I think I understand it now (hope correctly).

But you remind me one thing regarding the use of the “kover dataset create” commands to prepare training data. What I used to do was set one pseudo sample as the testing set. It seems that I can get a learned model without any testing sample, by keeping the test-ids file null, in case of the bound version, right?

Best, Kaixin

aldro61 commented 2 years ago

That's correct. You can either specify an empty list of testing IDs or simply use --train-size 1 in the kover dataset split command. No need to use pseudo-samples to simulate a testing set in this case.

augustkx commented 2 years ago

I see. Thanks!

aldro61 commented 2 years ago

You're welcome! Closing this issue now. Don't hesitate to reach out if you have any other questions.

augustkx commented 2 years ago

Just a note: The tools you used for generating kmers: DSK converts all kmers to their canonical representation with respect to reverse-complementation. A canonical kmer is not necessarily the lexicographically smallest one! DSK uses a different ordering for faster performance.

So if users use other tools for kmer generating, they should be aware of this.

Hi Alex, Will appreciate it if you could confirm this note, thanks!

dawnmy commented 2 years ago

I have the same question, does kover use the canonical kmer defined in DSK, i.e. A < C < T < G ?

aldro61 commented 2 years ago

Hello @augustkx and @dawnmy,

Thanks for the comment. I am not deeply familiar with the inner working of k-mer counting tools. We use the version of DSK included in GATB 1.2.2 (see here), so it may be possible to confirm by reading about this specific version. Whichever approach it uses, we use as-is. Please let me know if you find an answer there.

Best, Alex