allenai / wimbd

What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
Apache License 2.0
172 stars 18 forks source link

Passing in a directory does not work for me #5

Closed revbucket closed 6 months ago

revbucket commented 6 months ago

I have a directory with a large number of files (~216k) so passing through the standard wimbd stats dir/* doesn't work. But I can't quite make this pull request work on my machine.

Helpful tracebacks:

$ ls /raid/mgj528/base_dedup_bff/english50/ | wc -l
216937

$ wimbd --version
wimbd v0.1.1, dc7976b

$ wimbd stats /raid/mgj528/base_dedup_bff/english50/
ERROR [wimbd] at least one path is required
$ wimbd stats /raid/mgj528/base_dedup_bff/english50
ERROR [wimbd] at least one path is required
$ wimbd stats /raid/mgj528/base_dedup_bff/english50/*
-bash: /home/mgj528/.cargo/bin/wimbd: Argument list too long
revbucket commented 6 months ago

Closing my own issue:

The problem here was that my files have extensions of .jsonl.gz and wimbd was only looking for extensions of .json.gz. AFAIK this is fixed now.