OndrejSladky / kmercamel

KmerCamel🐫 provides implementations of several algorithms for efficiently representing a set of k-mers as a masked superstring.
MIT License
12 stars 2 forks source link

CLI: Better subcommand structure #78

Open karel-brinda opened 1 week ago

karel-brinda commented 1 week ago

I'm testing the latest version, and it's increasingly more obvious to me that we'll need to restructure the CLI. Currently it's confusing, and likely not permanent.

Specifically, we need a good structure of well separated subcommands. Also, might be useful to have a special MS-specicific file suffix, eg. .msfa.

The use should be more simple, eg:

kmercamel ms -k 31 genome.fa > genome.msfa
kmercamel optimize genome.msfa > genome_maskopt.msfa
kmercamel reformat -m mask.txt -s superstring.txt genome_maskopt.msfa
kmercamel reformat -P mask.txt -S superstring.txt > glued.msfa

Notes:

What do you think @OndrejSladky @PavelVesely ?

karel-brinda commented 1 week ago

Ok, immediately even after writing this ticket, I completely forgot the -c param to the command I was running, which likely made the computation much slower. This really needs to be fixed :) (I guess ~90% users forgets this as well.)

PavelVesely commented 5 days ago

One more thing: it'd be great to compute the MS and optimize the mask by one command --- we would avoid storing & loading non-optimized MS from the disk, and it's simpler to measure time and memory requirements of both steps if executed by a single command

karel-brinda commented 5 days ago

We actually discussed this at the prev meeting. This was the issue: while for max 1 it's simple, there're many things that can go wrong for min int, and there's risk the whole MS computation can be lost due to an error in the final optimization part.

karel-brinda commented 5 days ago

Maybe a solution could be, once we have the MS command, to implement default / greedy zero / max ones as a param, and allow min int only in the reoptimalizaton subcommand?

PavelVesely commented 5 days ago

We actually discussed this at the prev meeting. This was the issue: while for max 1 it's simple, there're many things that can go wrong for min int, and there's risk the whole MS computation can be lost due to an error in the final optimization part.

Makes sense -- optimizing the number of runs can fail due to large memory consumption or would just run for a very long time (say, a few days even if default MS is computed in hours)

Maybe a solution could be, once we have the MS command, to implement default / greedy zero / max ones as a param, and allow min int only in the reoptimalizaton subcommand?

This would still be useful, but it's not critical