The NUM_BINS constant in python/dolma/core/analyzer.py is 100k by default and this value overflows the 10**NUM_BINS expression in FixedBucketsValTracker. The _make_tracker function does not use the number of bins from the config but use the constant value. I guess this is because the counts are then summarised to the correct number of bins in the end?
I have tried using the InferBucketsValTracker instead and it seems to work. However, the bins array in the results is sometimes +1 larger than the counts, which is expected if the bins array represent the edges of the bins, but sometimes the bins and counts have the same length, so I am not sure what the bins in the final result represents?
dolma stat --attributes "mC4_da/attributes/v0tags/*.json.gz" --bins 100 --processes 12 --report v0tags_report2
attributes:
- mC4_da/attributes/v0tags/*.json.gz
bins: 100
debug: false
processes: 12
regex: null
report: v0tags_report2
seed: 0
work_dir:
input: null
output: null
Found 1,024 files to process
files: 0.00f [00:00, ?f/s] multiprocessing.pool.RemoteTraceback:
"""uments: 0.00d [00:00, ?d/s]
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/home/peter/kode/dolma_clean/python/dolma/core/parallel.py", line 174, in _process_single_and_save_status
cls.process_single(
File "/home/peter/kode/dolma_clean/python/dolma/core/analyzer.py", line 120, in process_single
trackers.setdefault(f"{attr_name}/score", _make_tracker()).add(score)
File "/home/peter/kode/dolma_clean/python/dolma/core/binning.py", line 245, in add
k = int(m * self.n), e
~~^~~~~~~~
OverflowError: int too large to convert to float
"""
The
NUM_BINS
constant inpython/dolma/core/analyzer.py
is 100k by default and this value overflows the10**NUM_BINS
expression inFixedBucketsValTracker
. The_make_tracker
function does not use the number of bins from the config but use the constant value. I guess this is because the counts are then summarised to the correct number of bins in the end?I have tried using the
InferBucketsValTracker
instead and it seems to work. However, the bins array in the results is sometimes+1
larger than thecounts
, which is expected if the bins array represent the edges of the bins, but sometimes the bins and counts have the same length, so I am not sure what the bins in the final result represents?