Open sanjinhub opened 1 week ago
Thanks for the feedback.
In case you haven't seen it, the Encoding Pipeline documentation has some tips on how to scale the pre-processing.
See also the dataset processing instructions, which give some suggested sharding parameters. For larger datasets, you may wish to load only a subset of the shards into a single server.
If the data reaches one hundred million, many problems arise:
block.binpb
; even without block data, it still needs to be input.Suggestions: