allenai / wimbd

What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
Apache License 2.0
172 stars 18 forks source link

Added S3 Support #11

Open revbucket opened 4 months ago

revbucket commented 4 months ago

Added S3 support for wimbd.

Some notes:

As far as profiling: On an EC2 instance (a c6g.8xl, 64vCPUs) and a pool of 10GiB of .jsonl.gz data (481 files, roughly 20MiB each):

Would prefer to have discussion about this PR/reviews public and on this forum, but please ping me on slack so I get a notification :D

epwalsh commented 4 months ago

At the moment I'm getting stuck with this error:

Collecting: files 0/1 [00:00:08] [---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]thread 'wimbd-worker' panicked at src/s3.rs:180:14:
Failed to read data from "s3://ai2-llm/pretraining-data/sources/c4/raw/en/train/c4-train.00000-of-01024.json.gz": Custom { kind: Other, error: Error { kind: StreamingError(ThroughputBelowMinimum { expected: Throughput { bytes_read: 1, per_time_elapsed: 1s }, actual: Throughput { bytes_read: 0, per_time_elapsed: 1s } }) } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
ERROR [wimbd] Thread worker(s) finished with errors

Any idea?