Allow flag truncation - Githubissues

pb-cdunn commented 5 years ago

You can label this as a feature request.

proc main*(ref_fn: string) =
  ...

./pb phasr --ref=subreads1.fasta
Unknown long option: "ref"

Maybe you meant one of:
        ref-fn help

Run with --help for full usage.
make: *** [default] Error 1
utils:nim% ./pb phasr --help
phasr [required&optional-params]
  Options(opt-arg sep :|=|spc):
  -h, --help                        print this cligen-erated help
  --help-syntax                     advanced: prepend,plurals,..
  -r=, --ref-fn=  string  REQUIRED  set ref_fn

Allowing flag-truncation would be a nice feature.

See discussion in #97.

c-blake commented 5 years ago

As a slight filling out of this issue, besides using something like the stdlib's critbits.itemsWithPrefix to prefix match user input against strings for long option keys, we will also eventually want it for Nim enum values as well as for dispatchMulti subcommand names.

This feature is quite involved in that it alters much core logic in generated code from case-of statements to other constructs. While the suggest machinery still has a role for total mismatches, that will also need to adjust its behavior for when itemsWithPrefix().len > 1 not just the current itemsWithPrefix().len == 0. I.e., "ambiguous prefix" rather than a maybe mis-keyed entry errors/suggestions. It is actually probably the single most complex new feature implemented since the original idea. Given its complexity, it's best to start with one domain and expand from there. enum values are probably the simplest such domain.

The value is also potentially great, though. It can ease ergonomics for end-users substantially, often making for one- to few-letter abbreviations of common key-stroking and constituting a pretty intuitive "built-in aliasing" system. When keying in commands hot&heavy one can be brief and made more verbose/self-documenting when finalizing such invocations in a script or documentation.

c-blake commented 5 years ago

Ok. I did enum values which was easy and seems to work fine in the context of the new dups duplicate detector example program [ which is also the fastest such program I have been able to find...about 1.25 to 2x faster and more flexible than jdupes on my systems which is a C program that's been performance focused for 20 years..all in more than 10x less source code. Goooo, Nim ;-) ]. In dups, both enum value sets happen to allow prefixes of the minimal single letter for every key without me even planning it that way, though I suppose some folks might have written "SHA1" as "sha1".

Anyway, that's probably as much as will happen on this before the next cligen release, just as an FYI. I do like this feature, though. Another way to describe it is a lot like TAB-autocompletion in shell/other editing environments without the need to even hit the TAB key, and automatic and built-in to the CLI engine (but also without the feedback of the TAB-key fully spelling out the word before hitting ENTER..So, not quite perfect, but very often nicer than not.)

c-blake commented 5 years ago

Ok. So, two of three cases are done here. All that really remains is the subcommand name matching (on the input side..that also needs helpCase attention on the output side). Let me know if you have any trouble.

c-blake commented 5 years ago

The subcommand case, arguably the one with most precedent (Mercurial's CLI), is the trickiest in that parseopt3 stopWords checking also needs a CritBitTree as well as the usual machinery in cligen. Not sure when I'll get to that, but with kebab-case now recognized for subcommands things are still much better than they were a few days ago.

pb-cdunn commented 5 years ago

I don't understand optionNormalize(). The normalized string should include some word separators. Otherwise, foo_bar and foobar are the same. This is the biggest mistake that Araq made, IMO. Nim's niche is as a replacement for Python, a high-productivity but fast language. But with identifier equality he has made C/C++ interop difficult, and now you're making it nearly impossible to replicate an existing CLI.

I think cligen should Do The Right Thing by default, but it should allow configuration for special cases. Suppose I have an existing tool which takes both --foo-bar and --foobar. I think you've made it impossible for me to rewrite that in Nim using cligen, right?

Maybe I'm wrong. Can I cause --foo-bar to delegate to foo_bar(), but --foobar to delegate to foobar_aux()? Well, I not, I guess I can live with it. I certainly wouldn't stop using cligen. But this is a nettlesome area of Nim.

c-blake commented 5 years ago

No, you're not wrong, and I actually agree with you in spirit and would wholeheartedly prefer everything to be case and underscore sensitive. If Nim ever becomes sensitive, I'd be ecstatic to track that. In practice, though, since I'm inferring a CLI from a style-insensitive set of identifiers, I don't see any real choice but to "follow along" with Nim's convention (with an extra squashing of dashes). Otherwise things that are distinct in the CLI name space would collide in the Nim space and there'd have to be some whole disambiguation rigamarole.

c-blake commented 5 years ago

(and you do give an example of what part of the rigamarole would look like..some kind of table mapping sensitive idents to insensitive).

The idea of cligen, as I see it, was never to "port" some existing syntax, but to synthesize it as intuitively as possible given the programming language context. You can always just use parseopt/parseopt3 to create whatever syntax you want. If you have firm constraints/are really picky about syntax that's always a possible eventuality.

c-blake commented 5 years ago

And I would qualify it as "nearly impossible to replicate some existing" CLIs, but only through the whole auto-inference mechanism. I doubt there are very many that have --foobar and --foo-bar do different things. Such would probably be pretty confusing to many CLI users, at least.

c-blake commented 5 years ago

The 3rd subcommand case now works. I.e., cligenerated parsers can now accept very abbreviated user input, e.g. ./test/MultMultMult a c y --ho 3 instead of the wordier ./test/MultMultMult apple cobbler yikes --hooves 3. So, closing this issue as resolved.

As to not drinking a fatal dose of the style-insensitive Nim Kool-Aid, I'm not dead-set against some optional way to specify a mapping..Probably lengthen and optionNormalize and maybe helpCase all grow an object parameter and allow CLI author configuration as desired with some default that is the most "Nim-esque". Personally, this strikes me as a very rare corner case of low priority, though, more theoretical than practical much like arbitrary back-ticked identifier names. Araq seems dead set against sensitivity, making dramatic claims like he "literally cannot wrap his head around how anyone with half a brain would want sensitivity".

I get the conflict - different devs/teams want different ident conventions, and there will always be disagreements and you want to provide a way to use libraries according to the local convention. I think a better way than insensitivity would be to have a sub-syntax/Nim DSL that let you "declare conventions" and have them checked/enforced by the compiler to whatever extent practical. E.g., "global vars begin or end with '_'" could be easily checked by the compiler. type names, proc names, enum values, and other symbol kinds could all similarly be checked. Then import could grow a generic parameter to specify the style being imported...as in import[snake] lib1, lib2 or import[camel] lib3, lib4. Then internally things can be sensitive, externally they can be whatever, and "ability/inabilty to translate" between them can be adjudicated at the time of style declaration. Nim could even provide 3 or 4 fixed styles that cover 99% of the disagreement space so that no import style was needed before any other import[camel] usage. This approach would be more like a scalpel instead of the chainsaw of normalizing idents. Anyway, I mentioned variants of this on the most recent "can we fix this aspect of Nim" forum thread ( https://forum.nim-lang.org/t/4388 ) to thunderous silence/non-response.

For cligen, though, fundamentally it is a Nim library re-purposing Nim proc definition syntax via getImpl as a CLI-specification language that basically user's don't have to learn because they already know it by circumstance. So, if the definition syntax is insensitive (and they know that even if they hate it), it seems to me the generated interface should also be, at least by default.

Also, the extent to which insensitivity is about not fretting over details of well-known tokens while keying in fast&furious "other things" and that actual interactive command running in a shell or shell like REPL is generally much more likely to be fast&furious, it's probably more important for cligen to be more insensitive than a language read/edited with an editor on a file. This is surely why gdb and gnuplot had the unambiguous prefix/truncation rules of this Issue back in the 1980s. Sure, those can also both be scripted (and are), but much of the gdb/gnuplot use cases were/are dynamically interactive. I think dynamic interactivity is also the crux of whether case-insensitivity is boon or bane in filesystem path usage, but not usually the focus of the sensitivity discussion. Honestly, it all comes down to "when do people want to be token-sloppy and when not" and that mostly comes down to REPL-like dynamic interactivity vs more careful editing (in my experience).

pb-cdunn commented 5 years ago

Glad to read your thoughts, and I largely agree.

Did some testing. Everything seems to be working well!

pb-cdunn commented 5 years ago

A colleague just coded this up in C++:

const CLI_v2::Option MinSnr{
R"({
    "names" : ["min-snr"],
    "names.hidden" : ["minSnr"],  # backward-compatible with camelCase
    "description" : "Minimum SNR of input subreads.",
    "type" : "float",
    "default" : 2.5
})"
};

I wanted to tell him how simple this is in Nim+cligen, but he loves C++.

c-blake commented 5 years ago

Ack. That's probably only part of what you have to learn to use CLI_v2.

As I think I've mentioned before, the argh/cligen auto-code generation approach is feasible, in principle, in any language with named parameters with default values where the parameter type can be inferred from the default value (as in Python) or is also available (as in Nim). That includes C++ and many other popular languages. Besides argh there are a few other Python attempts like http://micheles.github.io/plac/ which may have been the "first".

If you don't have a nice introspection API like Python or compile-time static macros like Nim and its getImpl, then you do have to write a parser for at least the function declaration sub-language, and "enough" to point that partial parser at some functions in some files..so maybe a way to bracket the function. There are a few widely used languages like pure C & Java that this might not work for. However, there are many it could work for. Of course, you won't find me signing up to write a C++ partial parser, but there are people who do crazy things like that, surprisingly enough.

My motivation for writing cligen was just something to have for myself since I got used to argh. My motivation for releasing cligen, dealing with support requests, etc. was to hope to lead by example a little, if not just in Nim, then perhaps elsewhere. You don't need a dynamic language to have great CLI developer (& user) ergonomics. I think it sells itself well, but thanks for promoting it to your colleague. :-)

The recent unambiguous prefix/flag truncation feature makes the user experience the best of any CLI framework I know. It is making me want to re-write everything I use a lot in Nim just to get it (but I may be an end-point person in being good at remembering abbreviations). gdb/gnuplot have had the feature forever, but as far as I know it is unique among CL frameworks. Why, if your users get used to this convenient feature, they may pester your colleague to do it and they might wind up doing their own version of that. It's not hard. I'm not sure why it's so rare. While critbits is fancier than the avg stdlib has, the "set sizes" of each namespace (option keys, enums, subcommands) are just structurally so small almost any dumb quadratic algo would also work fine.

c-blake commented 5 years ago

By the way, I put a little thing about unambiguous prefix matching in the BASIC CHEAT SHEET as just the 4th item available under anycommand --helps (with the new feature :-).

Not sure if it would really help someone who didn't know how it worked. Let me know if you have any suggested update to that help text. I try to keep that --help-syntax brief enough to mostly fit on one screen, but expansive enough to help people with advanced capabilities/trickier corners of cli-generated interfaces. It's a tough line to walk. In a documentation setting, confusion can be both precious and fleeting.

jbruchon commented 5 years ago

Where can I find the dups program mentioned in this thread? I am curious as to how it is faster. Is it doing full file byte-for-byte safety checks, or is it hashing and comparing hashes?

c-blake commented 5 years ago

The examples directory.

c-blake commented 5 years ago

Oh, and it can do the byte-by-byte checks if you want or just hashing if you want that, as with jdupes. You can pick from a few hash functions (from the near zero cost file size hash or the crazy slow SHA1), and tell it to cmp or not. Wang Yi's hash is pretty good as non-crypto hashes go, but my Nim impl does not have the same short string optimizations. For larger buffers it seemed much faster than the hash jdupes uses.

If you are trying to test timings, you should note that with Nim you usually need to compile with -d:release (and now maybe -d:danger depending on your Nim version) to get good performance. I did not do any sort of "constructed case" benchmarking - I just had 3..4 large file trees I tried them both on. I expect you probably have a more interesting variety of test cases that you've collected.

To use the parallel hashing feature (probably not a win unless you either are using SHA1 or really large but identically sized files) you need to activate OpenMP at the C compiler level as well with -fopenmp in your cflags in your nim.cfg. You can usually tell if you got that working by doing -HSHA1 and seeing if you get >100% CPU. In truth, at least in my experiments, parallelism is probably rarely useful. I can mentally construct some situations where it would pay off, but they probably aren't very organic.

Besides parallelism, there are probably a few other features in there jdupes doesn't have (and vice versa!) like doing mmapd IO (this may be a real source of speed advantage for fully buffered cases besides the faster hash), sorting by any of the file timestamps, and being able to consider "near duplicates" by only looking at slices of a file if headers or footer sizes are at all predictable.

I delegate path name generation to find (or Zsh ** extended globbing). I realize some people might want a more "turnkey" systems, but find has a whole mini-language for such stuff. As long as you can use print0 and take a file for paths that seems the best answer for path name generation.

Anyway, what I most like about that little 160 line Nim program is that the algorithms are not buried under much chatter. Indeed, I considered ripping out the parallelism work list complexity stuff when measurements showed it not really helping except for SHA1. If your aim is to speed up jdupes, I think a faster hash and mmaped IO would probably do the trick. Of course, if off-magnetic platter IO is the bottleneck then these won't help much, but off an NVMe card I think they do.

Oh, and at some amount of hash function speed, parallel RAM IO is actually faster than single-core RAM IO. So, that could kick in. The only hash function I've played with fast enough for that aspect is the one from

Daniel Lemire, Owen Kaser, Faster 64-bit universal hashing using carry-less multiplications, Journal of Cryptographic Engineering

That one, I've clocked at quite a bit faster than single core RAM BW (at least that I've personally had access to). Have never tried it in the context of this dups business, though.

jbruchon commented 5 years ago

FWIW jdupes has a -Q option that compares by hash without doing full file checks. A faster hash probably won't help much as the primary bottlenecks at this point are waiting on I/O operations to complete (I just fired off a test on an AMD A8-9600 with a big md RAID-5 and CPU usage never exceeded 3% max.) From what I've read mmap is not going to be faster for the sequential full file reads I'm doing. On Linux the fread() buffer size auto-tunes to half of the L1 D-cache size because that results in the lowest number of cache misses, which grow massively once past that number.

I'm looking into ways to get some parallel action going, but I'm not ready to crowbar any in just yet. The next major optimization I'm going to do is probably multi-threaded parallel file hashing for files on different physical devices, though I'm currently grappling with the difficulty in reliably finding out if two files share a physical volume. Perhaps on fast SSDs it would make sense to read multiple files at once from the same device, but most of my use cases involve spinning rust.

I do have access to large and diverse data sets, including servers with well over a million files, and usually ranging from a few blocks to hundreds of gigabytes all sitting on well-aged XFS and NTFS filesystems. /usr/src/linux-* is my favorite "large but still fits in the buffer cache" set to test with. Sadly, all the CPU in the world won't make the disks spin any faster.

c-blake commented 5 years ago

I was mostly comparing jdupes -mnr . with dups -cbls or adding the -Q to jdupes and dropping the c from dups in 100% RAM cached trials. So, probably not the point in design space that matters most to you (or anyone, arguably since it's the fast point not the slow pain point). Cold cache tests are much more involved to run - you have to echo 3 > /proc/sys/vm/drop_caches in there, and they're slower, and they're noisier meaning bigger multiple run sample sizes. So, it all multiplies out to probably 100+X more time to run them, spinning rust even more so, but you know all that.

Doing a hot cache test just now on my /usr/src/linux I just got times where jdupes was 7% faster (410 ms jdupes vs 439 ms dups). So, dups is definitely not "faster in all circumstances". I never meant to suggest that, of course - users could always turn on parallelism and compete for disk head location on spinning rust, etc. It was just a side comment about Nim elegance/performance in the context of my little CLI generation package, not a full analysis (especially for your 100% slow IO bound use case).

Collision distributions and specifics vary a great deal. During development I had a little report to print out the collision size distros pre- and post- hash splitting, but I ripped that out of the example code as a distracting 30-40 lines of code. I would recommend it for a more supported utility program like jdupes, for performance anomaly forensics if nothing else.

The "main event" of that dups.nim is just the iterator dupSets. One idea it uses which may or may not be in jdupes is always doing byte-by-byte compares on 2-way size collision sets (the most common collision set size in my organic cases) since hashing both files always takes at least as long (in both IO and CPU) as comparing, and sometimes much longer if the files differ in early bytes. I did try doing the analogous 3-way optimization but it didn't help in my (admittedly fairly superficial) benchmark cases. That could help more in your spinning rust scenarios, though.

And speaking of making sure you saturate however many 700-ish MB/s SATA busses or whatever SCSI/etc busses you have with parallel only across but never within disk head competition zones and your StackOverflow question..Given both the ever growing number of ways to alias/indirect across IO layers and portability concerns, I think the best answer is not interrogating st_dev, bus IDs and the like, but rather actually timing things. On the one hand that is very slow, but on the other hand, compatible parallel access st_dev sets change even more rarely than said slowness. So, it's an eminently cacheable answer and a timing approach could both A) be portable and B) elegantly work well with a mixture of Optane, NVMe, SSD, spinning rust, etc. Only the user knows what file trees they'll be running jdupes against and how real-time varying the backing stores may be, and jdupes only cares about "compatibly parallel st_dev", and jdupes is not alone. So, I feel those concerns should be well separated.

So, the measurement should even be an external program producing just a file of parallel IO compatible st_dev. jdupes could even offload whether/when to run said external program onto the user. Optimizing that parallel compatible measurement program is, of course, its own whole topic. I have not searched, but that seems like the kind of problem someone has probably spent a lot of time on and maybe even produced a useful tool for that jdupes could just recommend to create a file it can use.

Honestly, if you ask me, it would not be crazy to say this cached answer is something a good sysadmin setup would update only as required and provide to application programs in /etc. It may not be as static as /etc/fstab, but it's probably close to that static on 99+% of installations. There is of course always the issue of other users/totally unrelated processes competing with disk head locations. On multi-user IO systems that is fundamentally too dynamic to resolve simply { and hostile users can always put a wrench in the works with e.g. while(1) sync(); among other things - and even do so somewhat anonymously - there are no block dev IO quotas and 80-90% of IO is anonymized by the time it gets to block devs and /proc/PID/io only gives operation breakdown not block dev breakdown. Ack. }.

Anyway, some new /etc file would at least allow mutually cooperating threads/users a micro database to aid their orchestration which seems like very useful and mostly static metadata. If it turned out to be more dynamic than I suspect, Linux might add some /proc or /sys file to spit it out like it did with /etc/mounts, but those are probably far future days. Maybe Kubernetes has something like this in it. As I mentioned, I haven't looked, but it's obviously of general concern. The kernel is also in the best position to both know about and measure such parallel IO compatibility.

c-blake commented 5 years ago

By the way, it sounds like you deal with large file hierarchies ("millions"). So, the examples/newest.nim program might also be of interest to you. Of course, if you can identify a sentinel file then find -newer (-cnewer, etc) may work fine. If you don't know a sentinel, though, ls -ltr **(.) blows up on the argument list being too long, and really one doesn't need to keep more than a small heap of the ones you want in memory anyway. My newest example even supports a max(ctime,mtime) thing I call vtime (for version time). I'm unaware of any similar program in any programming language, but I use it all the time now.

c-blake commented 5 years ago

(Oh, and I know find -not -type d -printf '%C@ %T@ %P\n' | awk '{ if ($1 > $2) print $1,$3; else print $2,$3 }' | sort -g | tail -n10 | awk '{print $2}' works for e.g. "vtime", but this kind of sort|tail construct is common enough in my experience that I think there should be a most -nX utility that just uses a heap, but I don't know of one..or a sort -NX option. Also, newest supports the new "btimes" that find does not -- yet. Aaaanyway...)

c-blake / cligen

Allow flag truncation #99