Open IanMagnusson opened 1 year ago
Auth s3
dolma
on the soldni/warc
branchmaturin build -r dolma/target/wheels/dolma
instead of make develop:Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.
conda create -n dolma-baselines python=3.10
After creating the environment, activate it and install necessary tools using the included makefile.
conda activate dolma-baselines make setup
and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.
maturin build -r pip install target/wheels/dolma-0.9.0-*.whl
Follow the steps in this readme to decontaminate
Step 1.1: copy data locally
aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents
Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)
python configs/dolma-v1_5/decontamination/fix_ids_type.py ~/perplexity/*/*/*/*/*.gz
Step 1.2: tag out paragraphs by uniseg length
dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188 dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
Step 1.3: filter out paragraphs that are too short
dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix
Step 1.4: create bloom filter
dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe
dolma -c configs/baselines/decontamination/falcon-refinedweb.yaml dedupe
Note got errors like this during the previous command but Rodney assures us this is fine:
[2023-10-13T01:20:49Z ERROR dolma::deduper] Failed to process "s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/262.jsonl.gz": streaming error
[2023-10-13T01:20:49Z ERROR dolma::deduper] Failed to process "s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/25.jsonl.gz": streaming error
We follow from section 3 here as well as the existing dolma v1.5 configs such as the books one and make new configs for mixing these datasets.
dolma -c configs/baselines/mixing/falcon-refinedweb.json mix --processes 224
Unfortunately this currently removes all of the data except for empty strings and newlines it seems. The log on the last command looks like this:
...
[2023-10-20T03:38:52Z INFO dolma::shard] Dropped 1935962 of 1936000 documents from s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/97.js-refinedweb/v0/documents/97.jsonl.gz
[2023-10-20T03:38:54Z INFO dolma::shard] Dropped 1935962 of 1936000 documents from s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/96.js-refinedweb/v0/documents/96.jsonl.gz
[2023-10-20T03:39:09Z INFO dolma::shard] Dropped 1935970 of 1936000 documents from s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/58.js-refinedweb/v0/documents/58.jsonl.gz
[2023-10-20T03:39:09Z INFO dolma::mixer] Done!
To fix the issue with all the data getting removed by the decon we tried deleting the bloom filter in s3 before rerunning, as this is getting read in and added too rather than started fresh. It is unclear why this should change the filter (as the data it's being run on should be identical) unless something is causing the bloom filter indexing to shift such that the old filter is hashed differently.
aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin
and then we tried rerunning everything after this step:
dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe
However this still had the same issue of removing almost everything in the dedup.
Tried this approach again but restarting from the step below where the eval data that is used to build the bloom filter is created, after first removing the output directory for this in case something about how the bloom filter creation step adds attributes to this is a problem:
dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix
Additionally we changed the bloom filter byte size in configs/baselines/decontamination/falcon-refinedweb.yaml
to actually reflect the value reported during the bloom filter creation (ie size_in_bytes: 33554432
).
After this I am unfortunately still seeing the behavior with nearly all files being removed.
I tried something I just thought of to get some more info on debugging the decon issues: I tried running the decon pipeline using a copy of saved bloom filter for option 1 that I hadn't accidentally over written. So this bloom filter should be created correctly. However when I run it on Falcon it starts removing almost all documents the same way as when I remade the bloom filter. So this implies to me that the issue isn't with the bloom filter creation but rather in how we're using it.
Issues should have been fixed with #66.
Starting over from the top now with new Dolma version (commit 2ee1ae27f32c09531699301ef8271a6cb45da2da):
conda remove -n dolma-baselines --all
aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin
Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.
conda create -n dolma-baselines python=3.10
After creating the environment, activate it and install necessary tools using the included makefile.
conda activate dolma-baselines make setup
and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.
maturin build -r pip install target/wheels/dolma-0.9.0-*.whl
Follow the steps in this readme to decontaminate
Step 1.1: copy data locally
aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents
Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)
python configs/dolma-v1_5/decontamination/fix_ids_type.py ~/perplexity/*/*/*/*/*.gz
Step 1.2: tag out paragraphs by uniseg length
dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188 dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
Step 1.3: filter out paragraphs that are too short
dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix
Step 1.4: create bloom filter
dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe
Now let's do this with Pile since we want to train on it first. So we mark contamination:
dolma -c configs/baselines/decontamination/pile.yaml dedupe
Then we remove contamination:
dolma -c configs/baselines/mixing/pile.json mix --processes 224
Unfortunately this still results in near total removal:
[2023-10-26T17:13:50Z INFO dolma::shard] Dropped 1403904 of 1403954 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/08_0.json.gz
[2023-10-26T17:13:52Z INFO dolma::shard] Dropped 1404592 of 1404658 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/02_0.json.gz
[2023-10-26T17:13:56Z INFO dolma::shard] Dropped 1402981 of 1404511 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/23_4.json.gz
[2023-10-26T17:13:57Z INFO dolma::shard] Dropped 1403542 of 1403597 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/28_1.json.gz
[2023-10-26T17:14:04Z INFO dolma::shard] Dropped 1403859 of 1404028 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/21_3.json.gz
Overall we have only 145725 / 210607728 = 0.0006919261766 of documents retained.
Okay I think the issue is that the old setup instructions had me installing the wrong wheels so here we go again but now with the right wheels.
Starting over from the top now with new Dolma version (commit 2ee1ae27f32c09531699301ef8271a6cb45da2da):
conda remove -n dolma-baselines --all
aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin
rm -r ~/perplexity/*
rm target/wheels/*
rm -r /mnt/tank/dolma_tmp/pile_*
aws s3 rm --recursive s3://ai2-llm/pretraining-data/sources/pile/v0/attributes/perplexity_suite_v3_option2/
aws s3 rm --recursive s3://ai2-llm/pretraining-data/sources/pile/v0_decon_ppl_suite_v3/
Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.
conda create -n dolma-baselines python=3.10
After creating the environment, activate it and install necessary tools using the included makefile.
conda activate dolma-baselines make setup
and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.
maturin build -r pip install target/wheels/dolma-0.9.1-*.whl
Follow the steps in this readme to decontaminate
Step 1.1: copy data locally
aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents
Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)
python configs/dolma-v1_5/decontamination/fix_ids_type.py ~/perplexity/*/*/*/*/*.gz
Step 1.2: tag out paragraphs by uniseg length
dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188 dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
Step 1.3: filter out paragraphs that are too short
dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix
Step 1.4: create bloom filter
dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe
Now let's do this with Pile since we want to train on it first. So we mark contamination:
dolma -c configs/baselines/decontamination/pile.yaml dedupe
Then we remove contamination:
dolma -c configs/baselines/mixing/pile.json mix --processes 224
This initially errored out like this:
[2023-10-26T18:24:37Z INFO dolma::shard] Dropped 38520 of 1404145 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/07_0.json.gz
[2023-10-26T18:30:51Z ERROR dolma::mixer] 1 shards failed to process.
Traceback (most recent call last):
File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/__init__.py", line 25, in mixer
_dolma.mixer_entrypoint(json.dumps(config))
RuntimeError: Failed with 1 errors
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ianm/miniconda3/envs/dolma-baselines/bin/dolma", line 8, in <module>
sys.exit(main())
File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__main__.py", line 67, in main
AVAILABLE_COMMANDS[args.__dict__.pop("command")].run_from_args(args=args, config=config)
File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__init__.py", line 182, in run_from_args
return cls.run(parsed_config)
File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/mixer.py", line 141, in run
mixer(dict_config)
File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/__init__.py", line 27, in mixer
raise DolmaRustPipelineError(f"Error running mixer: {e}") from e
dolma.core.errors.DolmaRustPipelineError: Error running mixer: Failed with 1 errors
Rerunning the command didn't seem to reuse any of the already completed results, but it did finish without errors this time.
Removal is more moderate this time, though surprisingly consistent from file to file:
[2023-10-26T18:42:13Z INFO dolma::shard] Dropped 38466 of 1402989 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/11_2.json.gz
[2023-10-26T18:42:16Z INFO dolma::shard] Dropped 38337 of 1403669 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/13_3.json.gz
[2023-10-26T18:42:17Z INFO dolma::shard] Dropped 38748 of 1404080 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/03_1.json.gz
[2023-10-26T18:42:17Z INFO dolma::shard] Dropped 38472 of 1403675 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/12_4.json.gz
[2023-10-26T18:42:18Z INFO dolma::shard] Dropped 38918 of 1403475 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/15_1.json.gz
[2023-10-26T18:42:18Z INFO dolma::shard] Dropped 38708 of 1404626 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/10_4.json.gz
[2023-10-26T18:42:20Z INFO dolma::shard] Dropped 38391 of 1403446 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/05_2.json.gz
[2023-10-26T18:42:21Z INFO dolma::shard] Dropped 38592 of 1404508 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/23_3.json.gz
[2023-10-26T18:42:21Z INFO dolma::shard] Dropped 38782 of 1404000 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/16_2.json.gz
[2023-10-26T18:42:30Z INFO dolma::shard] Dropped 38647 of 1402989 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/11_3.json.gz
Overall we have only 204809882 / 210607728 = 0.9724708772 of documents retained.
Next we're trying to tokenize
dolma tokens --documents "/mnt/tank/dolma_tmp/results/pile/v0_decon_ppl_suite_v3/*.json.gz" --destination /mnt/tank/dolma_tmp/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special --tokenizer_name_or_path allenai/eleuther-ai-gpt-neox-20b-pii-special --processes 224 --seed 3920
But this gets the following error:
Traceback (most recent call last):
File "/home/ianm/miniconda3/envs/dolma-baselines/bin/dolma", line 8, in <module>
sys.exit(main())
File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__main__.py", line 67, in main
AVAILABLE_COMMANDS[args.__dict__.pop("command")].run_from_args(args=args, config=config)
File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__init__.py", line 182, in run_from_args
return cls.run(parsed_config)
File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/tokenizer.py", line 103, in run
tokenize_in_parallel(
File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/tokenizer/executor.py", line 191, in tokenize_in_parallel
multiprocessing.set_start_method("spawn")
File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
raise RuntimeError('context has already been set')
RuntimeError: context has already been set
Luca says to just remove the offending line. So we rebuild after removing:
dolma/tokenizer/executor.py", line 191, in tokenize_in_parallel
multiprocessing.set_start_method("spawn")
Rebuild env
conda create -n dolma-baselines-fixed python=3.10
conda activate dolma-baselines-fixed
rm target/wheels/dolma-0.9.1-*.whl
maturin build -r
pip install target/wheels/dolma-0.9.1-*.whl
Then try again:
dolma tokens --documents "/mnt/tank/dolma_tmp/results/pile/v0_decon_ppl_suite_v3/*.json.gz" --destination /mnt/tank/dolma_tmp/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special --tokenizer_name_or_path allenai/eleuther-ai-gpt-neox-20b-pii-special --processes 224 --seed 3920
This works and we upload the results to s3://ai2-llm/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special/
Now applying all this to RedPajama we get:
parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/redpajama/v1/attributes/perplexity_suite_v3_option2/split=train/dataset=*/*.gz | wc -l
parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/redpajama/v1/attributes/perplexity_suite_v3_option2/split=train/dataset=*/*.gz | awk '{sum += $1} END {print sum}'
900799243 / 901687943 = 0.999014404 documents retained
And tokenize
dolma -c configs/baselines/tokenization/redpajama.yaml tokens
And now falcon:
decon
dolma -c configs/baselines/decontamination/falcon-refinedweb.yaml dedupe
mix
dolma -c configs/baselines/mixing/falcon-refinedweb.json mix --processes 224
check doc removal
aws s3 sync s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0-0.05-heldout-complement/ /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/
parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/attributes/perplexity_suite_v3_option2/*.gz | wc -l
parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/attributes/perplexity_suite_v3_option2/*.gz | awk '{sum += $1} END {print sum}'
912114192 / 918848690 = 0.9926707214 docs retained
Tokenize
dolma -c configs/baselines/tokenization/falcon-refinedweb.yaml tokens
We're redoing Pile tokenization now cuz of a bug when tokenizing with more parallel processes than files in the dataset. We push a new config and run:
dolma -c configs/baselines/tokenization/pile.yaml tokens
resulting in:
dolma -c configs/baselines/tokenization/pile.yaml tokens
batch_size: 10000
debug: false
destination: s3://ai2-llm/preprocessed/pile/v0_decon_ppl_suite_v3_fixed/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/pile/v0_decon_ppl_suite_v3/*.json.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 150
ring_size: 8
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
input: /mnt/tank/tmp/pile_v0_decon_ppl_suite_v3_fixed_input
output: /mnt/tank/tmp/pile_v0_decon_ppl_suite_v3_fixed_output
memmaps: 300m [2:45:43, 33.1s/m]
tokens: 307Gt [2:45:43, 30.9Mt/s]s]
documents: 205Md [2:45:43, 20.6kd/s]
files: 150f [2:45:43, 66.3s/f]]
Now lets do c4:
conda activate dolma-baselines-fixed
export TMPDIR=/mnt/tank/tmp/
This data is already deconned for Dolma, so we go right to check removal
aws s3 sync s3://ai2-llm/pretraining-data/sources/c4/v0/attributes/perplexity_suite_v3_option2/ /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/
parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/train/*.gz | wc -l
parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/train/*.gz | awk '{sum += $1} END {print sum}'
364156258 / 364156258 = 100% documents retained
This seems unlikely so we try deconning again.
dolma -c configs/baselines/decontamination/c4.yaml dedupe
check again:
aws s3 sync s3://ai2-llm/pretraining-data/sources/c4/v0/attributes/perplexity_suite_v3_option2_redo/ /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/
parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/train/*.gz | wc -l
parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/train/*.gz | awk '{sum += $1} END {print sum}'
364121142 / 364156258 = 0.9999035689 doc retention rate
Mix
dolma -c configs/baselines/mixing/c4.json mix --processes 224
check the number of files to make sure its > 224 (cpus on this machine)
aws s3 ls s3://ai2-llm/pretraining-data/sources/c4/v0_decon_ppl_suite_v3/ | grep .json.gz | wc -l
496 files
Tokenize
dolma -c configs/baselines/tokenization/c4.yaml tokens
batch_size: 10000
debug: false
destination: s3://ai2-llm/preprocessed/c4/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/c4/v0_decon_ppl_suite_v3/*.json.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 224
ring_size: 8
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
input: /mnt/tank/dolma_tmp/c4_input_tokenized
output: /mnt/tank/dolma_tmp/c4_output_tokenized
memmaps: 233m [1:17:23, 19.9s/m]
tokens: 174Gt [1:17:23, 37.6Mt/s]]
documents: 364Md [1:17:23, 78.4kd/s]
files: 496f [1:17:23, 9.36s/f]m]
Now mc4:
conda activate dolma-baselines-fixed
export TMPDIR=/mnt/tank/tmp/
dedup
dolma -c configs/baselines/decontamination/mc4.yaml dedupe
Check removal
parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/ai2-llm/pretraining-data/sources/mc4/en-wimbd-splits/attributes/perplexity_suite_v3_option2/train/*.gz | wc -l
parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/ai2-llm/pretraining-data/sources/mc4/en-wimbd-splits/attributes/perplexity_suite_v3_option2/train/*.gz | awk '{sum += $1} END {print sum}'
3928652800 / 3928733374 = 0.9999794911
Mix
dolma -c configs/baselines/mixing/mc4.json mix --processes 224
Tokenize
dolma -c configs/baselines/tokenization/mc4.yaml tokens
Now we'll make a dolma-cc-only dataset. This just needs tokenization but for some reason it needs the code at main on afab18c9b4f48be9a4df27552afb79b6e2a2a745
conda create -n dolma-main-latest python=3.10
conda activate dolma-main-latest
mv target/wheels/ target/wheels_bak
make setup
maturin build -r
pip install target/wheels/dolma-0.9.2-cp310-cp310-manylinux_2_34_x86_64.whl
Then tokenize:
dolma -c configs/baselines/tokenization/dolma_v1_5_cc_only.yaml tokens
Working on creating data with dolma v1.5 style decontamination from baseline datasets. Progress so far is commented below.