ashvardanian / StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
https://ashvardanian.com/posts/stringzilla/
Apache License 2.0
2.05k stars 66 forks source link

Standard-compliant ws and split implementation (Issue 97 98) #99

Closed ghazariann closed 6 months ago

ghazariann commented 6 months ago

I used argparse to handle the arguments and flags, mirroring those of the system's split and wc commands to maintain a consistent user experience. The wc functionality is replicated to align with these system (GNU) commands. To avoid complicating usage, the implementation omits some flags from the split function (such as -a, -b, -C, -d), focusing instead on essential features: -t (separator), -n (chunk size), -l (line size), and standard input handling. Are there any suggestions for further improvements?

ashvardanian commented 6 months ago

Thanks for the patch! Looks good at the first glance, I will look deeper in a few hours. Can you wc variant handle directories and log stats for many files in it?

ashvardanian commented 6 months ago

Also, as the functionality is maturing, it would be great to add tests. Any chance you can start the scripts/test_cli_wc.py and scripts/test_cli_split.py, @ghazariann? Thanks again!

ghazariann commented 6 months ago

Sounds good @ashvardanian. Currently I am handling multiple files input by simple for loop. Do we need parallel approach for dictionary? (The files inside might be a lot). Maybe threading? Could you also clarify what kind of log stats we are talking about? Agree that we need tests. Will work on it !

ashvardanian commented 6 months ago

@ghazariann, I was also thinking about parallelism, but not sure about how to implement it. Let's start by patching what I've described in #97 and tests.

The best test on Linux would be to compare the output of the default CLI tools against StringZilla variants. I suspect, we may have to use the locale metadata to determine, what is considered whitespace/newline in each region to get the same results.

ashvardanian commented 6 months ago

:tada: This PR is included in version 3.3.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket: