allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
972 stars 107 forks source link

Mixer validator #215

Closed mariia-iureva closed 1 week ago

mariia-iureva commented 2 weeks ago

This validator ensures that the Mixer job configuration is correct, data is properly aligned, and filters are valid before starting the actual Mixer job.

Configuration Validation

S3 Path and Permission Validation

Stream Filter Validation

Document and Attribute Alignment

Filter Execution Simulation

Attribute Name Validation

File Sampling and Analysis

Reporting and Logging

Error Handling and Cleanup

Whattabatt commented 2 weeks ago

You're going to need to add this file's dependencies to pyproject.toml in order for it to run in a clean environment

mariia-iureva commented 2 weeks ago

You're going to need to add this file's dependencies to pyproject.toml in order for it to run in a clean environment

Addressed this one and added dependencies

Whattabatt commented 2 weeks ago

The warnings produce a lot of noise, it'd be good to accept a 'verbose' flag and only log the warnings if it's set to true.

mariia-iureva commented 2 weeks ago

The warnings produce a lot of noise, it'd be good to accept a 'verbose' flag and only log the warnings if it's set to true.

Added --verbose flag and hid most of the print statements in it