allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
976 stars 108 forks source link

Dolma stat crashes because number of bins overflows python integer #70

Closed peterbjorgensen closed 11 months ago

peterbjorgensen commented 1 year ago

The NUM_BINS constant in python/dolma/core/analyzer.py is 100k by default and this value overflows the 10**NUM_BINS expression in FixedBucketsValTracker. The _make_tracker function does not use the number of bins from the config but use the constant value. I guess this is because the counts are then summarised to the correct number of bins in the end?

I have tried using the InferBucketsValTracker instead and it seems to work. However, the bins array in the results is sometimes +1 larger than the counts, which is expected if the bins array represent the edges of the bins, but sometimes the bins and counts have the same length, so I am not sure what the bins in the final result represents?

dolma stat --attributes "mC4_da/attributes/v0tags/*.json.gz" --bins 100 --processes 12 --report v0tags_report2
attributes:
- mC4_da/attributes/v0tags/*.json.gz
bins: 100
debug: false
processes: 12
regex: null
report: v0tags_report2
seed: 0
work_dir:
  input: null
  output: null
Found 1,024 files to process
files: 0.00f [00:00, ?f/s]    multiprocessing.pool.RemoteTraceback:
"""uments: 0.00d [00:00, ?d/s]
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/peter/kode/dolma_clean/python/dolma/core/parallel.py", line 174, in _process_single_and_save_status
    cls.process_single(
  File "/home/peter/kode/dolma_clean/python/dolma/core/analyzer.py", line 120, in process_single
    trackers.setdefault(f"{attr_name}/score", _make_tracker()).add(score)
  File "/home/peter/kode/dolma_clean/python/dolma/core/binning.py", line 245, in add
    k = int(m * self.n), e
            ~~^~~~~~~~
OverflowError: int too large to convert to float
"""