Open revbucket opened 4 months ago
At the moment I'm getting stuck with this error:
Collecting: files 0/1 [00:00:08] [---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]thread 'wimbd-worker' panicked at src/s3.rs:180:14:
Failed to read data from "s3://ai2-llm/pretraining-data/sources/c4/raw/en/train/c4-train.00000-of-01024.json.gz": Custom { kind: Other, error: Error { kind: StreamingError(ThroughputBelowMinimum { expected: Throughput { bytes_read: 1, per_time_elapsed: 1s }, actual: Throughput { bytes_read: 0, per_time_elapsed: 1s } }) } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
ERROR [wimbd] Thread worker(s) finished with errors
Any idea?
Added S3 support for wimbd.
Some notes:
warning: function `expand_s3_dir` is never used
but this function is totally used. But it's used within a separate tokio runtime so maybe that's it? Silencing this in the compile would probably be better.As far as profiling: On an EC2 instance (a c6g.8xl, 64vCPUs) and a pool of 10GiB of .jsonl.gz data (481 files, roughly 20MiB each):
download -> run-wimbd
flow, which I'll call a win in my book.Would prefer to have discussion about this PR/reviews public and on this forum, but please ping me on slack so I get a notification :D