allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
976 stars 108 forks source link

Baseline data #61

Open IanMagnusson opened 1 year ago

IanMagnusson commented 1 year ago

Working on creating data with dolma v1.5 style decontamination from baseline datasets. Progress so far is commented below.

IanMagnusson commented 1 year ago

Process so far

Setup Environment

Auth s3

  1. clone dolma on the soldni/warc branch
  2. Follow some instructions for building Dolma (but with maturin build -r dolma/target/wheels/dolma instead of make develop:

Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.

conda create -n dolma-baselines python=3.10

After creating the environment, activate it and install necessary tools using the included makefile.

conda activate dolma-baselines
make setup

and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.

maturin build -r 
pip install target/wheels/dolma-0.9.0-*.whl

Decon

Follow the steps in this readme to decontaminate

Step 1.1: copy data locally

aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents
aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

python configs/dolma-v1_5/decontamination/fix_ids_type.py ~/perplexity/*/*/*/*/*.gz

Step 1.2: tag out paragraphs by uniseg length

dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188

Step 1.3: filter out paragraphs that are too short

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix

Step 1.4: create bloom filter

dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe

Now use new configs to tag contamination in baseline data

dolma -c configs/baselines/decontamination/falcon-refinedweb.yaml dedupe

Note got errors like this during the previous command but Rodney assures us this is fine:

[2023-10-13T01:20:49Z ERROR dolma::deduper] Failed to process "s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/262.jsonl.gz": streaming error
[2023-10-13T01:20:49Z ERROR dolma::deduper] Failed to process "s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/25.jsonl.gz": streaming error

Run the Mixer to remove contamination

We follow from section 3 here as well as the existing dolma v1.5 configs such as the books one and make new configs for mixing these datasets.

dolma -c configs/baselines/mixing/falcon-refinedweb.json mix --processes 224

Unfortunately this currently removes all of the data except for empty strings and newlines it seems. The log on the last command looks like this:

...
[2023-10-20T03:38:52Z INFO  dolma::shard] Dropped 1935962 of 1936000 documents from s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/97.js-refinedweb/v0/documents/97.jsonl.gz
[2023-10-20T03:38:54Z INFO  dolma::shard] Dropped 1935962 of 1936000 documents from s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/96.js-refinedweb/v0/documents/96.jsonl.gz
[2023-10-20T03:39:09Z INFO  dolma::shard] Dropped 1935970 of 1936000 documents from s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/58.js-refinedweb/v0/documents/58.jsonl.gz
[2023-10-20T03:39:09Z INFO  dolma::mixer] Done!
IanMagnusson commented 1 year ago

To fix the issue with all the data getting removed by the decon we tried deleting the bloom filter in s3 before rerunning, as this is getting read in and added too rather than started fresh. It is unclear why this should change the filter (as the data it's being run on should be identical) unless something is causing the bloom filter indexing to shift such that the old filter is hashed differently.

aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

and then we tried rerunning everything after this step:

dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe

However this still had the same issue of removing almost everything in the dedup.

IanMagnusson commented 1 year ago

Tried this approach again but restarting from the step below where the eval data that is used to build the bloom filter is created, after first removing the output directory for this in case something about how the bloom filter creation step adds attributes to this is a problem:

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix

Additionally we changed the bloom filter byte size in configs/baselines/decontamination/falcon-refinedweb.yaml to actually reflect the value reported during the bloom filter creation (ie size_in_bytes: 33554432).

After this I am unfortunately still seeing the behavior with nearly all files being removed.

IanMagnusson commented 1 year ago

I tried something I just thought of to get some more info on debugging the decon issues: I tried running the decon pipeline using a copy of saved bloom filter for option 1 that I hadn't accidentally over written. So this bloom filter should be created correctly. However when I run it on Falcon it starts removing almost all documents the same way as when I remade the bloom filter. So this implies to me that the issue isn't with the bloom filter creation but rather in how we're using it.

soldni commented 1 year ago

Issues should have been fixed with #66.

IanMagnusson commented 1 year ago

Starting over from the top now with new Dolma version (commit 2ee1ae27f32c09531699301ef8271a6cb45da2da):

conda remove -n dolma-baselines --all
aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

Setup Environment

Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.

conda create -n dolma-baselines python=3.10

After creating the environment, activate it and install necessary tools using the included makefile.

conda activate dolma-baselines
make setup

and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.

maturin build -r 
pip install target/wheels/dolma-0.9.0-*.whl

Decon

Follow the steps in this readme to decontaminate

Step 1.1: copy data locally

aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents
aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

python configs/dolma-v1_5/decontamination/fix_ids_type.py ~/perplexity/*/*/*/*/*.gz

Step 1.2: tag out paragraphs by uniseg length

dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188

Step 1.3: filter out paragraphs that are too short

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix

Step 1.4: create bloom filter

dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe

Now let's do this with Pile since we want to train on it first. So we mark contamination:

dolma -c configs/baselines/decontamination/pile.yaml dedupe

Then we remove contamination:

dolma -c configs/baselines/mixing/pile.json mix --processes 224

Unfortunately this still results in near total removal:

[2023-10-26T17:13:50Z INFO  dolma::shard] Dropped 1403904 of 1403954 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/08_0.json.gz              
[2023-10-26T17:13:52Z INFO  dolma::shard] Dropped 1404592 of 1404658 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/02_0.json.gz              
[2023-10-26T17:13:56Z INFO  dolma::shard] Dropped 1402981 of 1404511 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/23_4.json.gz              
[2023-10-26T17:13:57Z INFO  dolma::shard] Dropped 1403542 of 1403597 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/28_1.json.gz              
[2023-10-26T17:14:04Z INFO  dolma::shard] Dropped 1403859 of 1404028 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/21_3.json.gz 

Overall we have only 145725 / 210607728 = 0.0006919261766 of documents retained.

IanMagnusson commented 1 year ago

Okay I think the issue is that the old setup instructions had me installing the wrong wheels so here we go again but now with the right wheels.

Starting over from the top now with new Dolma version (commit 2ee1ae27f32c09531699301ef8271a6cb45da2da):

conda remove -n dolma-baselines --all
aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin
rm -r ~/perplexity/*
rm target/wheels/*
rm -r /mnt/tank/dolma_tmp/pile_*
aws s3 rm --recursive s3://ai2-llm/pretraining-data/sources/pile/v0/attributes/perplexity_suite_v3_option2/
aws s3 rm --recursive s3://ai2-llm/pretraining-data/sources/pile/v0_decon_ppl_suite_v3/

Setup Environment

Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.

conda create -n dolma-baselines python=3.10

After creating the environment, activate it and install necessary tools using the included makefile.

conda activate dolma-baselines
make setup

and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.

maturin build -r 
pip install target/wheels/dolma-0.9.1-*.whl

Decon

Follow the steps in this readme to decontaminate

Step 1.1: copy data locally

aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents
aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

python configs/dolma-v1_5/decontamination/fix_ids_type.py ~/perplexity/*/*/*/*/*.gz

Step 1.2: tag out paragraphs by uniseg length

dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188

Step 1.3: filter out paragraphs that are too short

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix

Step 1.4: create bloom filter

dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe

Now let's do this with Pile since we want to train on it first. So we mark contamination:

dolma -c configs/baselines/decontamination/pile.yaml dedupe

Then we remove contamination:

dolma -c configs/baselines/mixing/pile.json mix --processes 224

This initially errored out like this:

[2023-10-26T18:24:37Z INFO  dolma::shard] Dropped 38520 of 1404145 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/07_0.json.gz                
[2023-10-26T18:30:51Z ERROR dolma::mixer] 1 shards failed to process.                                                                                                       
Traceback (most recent call last):                                                                                                                                          
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/__init__.py", line 25, in mixer                                                       
    _dolma.mixer_entrypoint(json.dumps(config))                                                                                                                             
RuntimeError: Failed with 1 errors

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ianm/miniconda3/envs/dolma-baselines/bin/dolma", line 8, in <module>
    sys.exit(main())
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__main__.py", line 67, in main
    AVAILABLE_COMMANDS[args.__dict__.pop("command")].run_from_args(args=args, config=config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__init__.py", line 182, in run_from_args
    return cls.run(parsed_config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/mixer.py", line 141, in run
    mixer(dict_config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/__init__.py", line 27, in mixer
    raise DolmaRustPipelineError(f"Error running mixer: {e}") from e
dolma.core.errors.DolmaRustPipelineError: Error running mixer: Failed with 1 errors

Rerunning the command didn't seem to reuse any of the already completed results, but it did finish without errors this time.

Removal is more moderate this time, though surprisingly consistent from file to file:

[2023-10-26T18:42:13Z INFO  dolma::shard] Dropped 38466 of 1402989 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/11_2.json.gz
[2023-10-26T18:42:16Z INFO  dolma::shard] Dropped 38337 of 1403669 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/13_3.json.gz
[2023-10-26T18:42:17Z INFO  dolma::shard] Dropped 38748 of 1404080 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/03_1.json.gz
[2023-10-26T18:42:17Z INFO  dolma::shard] Dropped 38472 of 1403675 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/12_4.json.gz
[2023-10-26T18:42:18Z INFO  dolma::shard] Dropped 38918 of 1403475 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/15_1.json.gz
[2023-10-26T18:42:18Z INFO  dolma::shard] Dropped 38708 of 1404626 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/10_4.json.gz
[2023-10-26T18:42:20Z INFO  dolma::shard] Dropped 38391 of 1403446 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/05_2.json.gz
[2023-10-26T18:42:21Z INFO  dolma::shard] Dropped 38592 of 1404508 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/23_3.json.gz
[2023-10-26T18:42:21Z INFO  dolma::shard] Dropped 38782 of 1404000 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/16_2.json.gz
[2023-10-26T18:42:30Z INFO  dolma::shard] Dropped 38647 of 1402989 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/11_3.json.gz

Overall we have only 204809882 / 210607728 = 0.9724708772 of documents retained.

IanMagnusson commented 1 year ago

Next we're trying to tokenize

dolma tokens --documents "/mnt/tank/dolma_tmp/results/pile/v0_decon_ppl_suite_v3/*.json.gz" --destination /mnt/tank/dolma_tmp/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special --tokenizer_name_or_path allenai/eleuther-ai-gpt-neox-20b-pii-special --processes 224 --seed 3920

But this gets the following error:

Traceback (most recent call last):
  File "/home/ianm/miniconda3/envs/dolma-baselines/bin/dolma", line 8, in <module>
    sys.exit(main())
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__main__.py", line 67, in main
    AVAILABLE_COMMANDS[args.__dict__.pop("command")].run_from_args(args=args, config=config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__init__.py", line 182, in run_from_args
    return cls.run(parsed_config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/tokenizer.py", line 103, in run
    tokenize_in_parallel(
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/tokenizer/executor.py", line 191, in tokenize_in_parallel
    multiprocessing.set_start_method("spawn")
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

Luca says to just remove the offending line. So we rebuild after removing:

dolma/tokenizer/executor.py", line 191, in tokenize_in_parallel
    multiprocessing.set_start_method("spawn")

Rebuild env

conda create -n dolma-baselines-fixed python=3.10
conda activate dolma-baselines-fixed
rm target/wheels/dolma-0.9.1-*.whl
maturin build -r 
pip install target/wheels/dolma-0.9.1-*.whl

Then try again:

dolma tokens --documents "/mnt/tank/dolma_tmp/results/pile/v0_decon_ppl_suite_v3/*.json.gz" --destination /mnt/tank/dolma_tmp/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special --tokenizer_name_or_path allenai/eleuther-ai-gpt-neox-20b-pii-special --processes 224 --seed 3920

This works and we upload the results to s3://ai2-llm/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special/

IanMagnusson commented 1 year ago

Now applying all this to RedPajama we get:

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/redpajama/v1/attributes/perplexity_suite_v3_option2/split=train/dataset=*/*.gz | wc -l
parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/redpajama/v1/attributes/perplexity_suite_v3_option2/split=train/dataset=*/*.gz | awk '{sum += $1} END {print sum}'

900799243 / 901687943 = 0.999014404 documents retained

And tokenize

dolma -c configs/baselines/tokenization/redpajama.yaml tokens
IanMagnusson commented 12 months ago

And now falcon:

decon

dolma -c configs/baselines/decontamination/falcon-refinedweb.yaml dedupe

mix

dolma -c configs/baselines/mixing/falcon-refinedweb.json mix --processes 224

check doc removal

aws s3 sync s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0-0.05-heldout-complement/ /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/ 

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/attributes/perplexity_suite_v3_option2/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/attributes/perplexity_suite_v3_option2/*.gz | awk '{sum += $1} END {print sum}'

912114192 / 918848690 = 0.9926707214 docs retained

Tokenize

dolma -c configs/baselines/tokenization/falcon-refinedweb.yaml tokens
IanMagnusson commented 11 months ago

We're redoing Pile tokenization now cuz of a bug when tokenizing with more parallel processes than files in the dataset. We push a new config and run:

dolma -c configs/baselines/tokenization/pile.yaml tokens

resulting in:

dolma -c configs/baselines/tokenization/pile.yaml tokens
batch_size: 10000
debug: false
destination: s3://ai2-llm/preprocessed/pile/v0_decon_ppl_suite_v3_fixed/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/pile/v0_decon_ppl_suite_v3/*.json.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 150
ring_size: 8
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
  input: /mnt/tank/tmp/pile_v0_decon_ppl_suite_v3_fixed_input
  output: /mnt/tank/tmp/pile_v0_decon_ppl_suite_v3_fixed_output
memmaps: 300m [2:45:43, 33.1s/m]
tokens: 307Gt [2:45:43, 30.9Mt/s]s]  
documents: 205Md [2:45:43, 20.6kd/s]
files: 150f [2:45:43, 66.3s/f]] 
IanMagnusson commented 11 months ago

Now lets do c4:

conda activate dolma-baselines-fixed
export TMPDIR=/mnt/tank/tmp/

This data is already deconned for Dolma, so we go right to check removal

aws s3 sync s3://ai2-llm/pretraining-data/sources/c4/v0/attributes/perplexity_suite_v3_option2/ /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/ 

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/train/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/train/*.gz | awk '{sum += $1} END {print sum}'

364156258 / 364156258 = 100% documents retained

This seems unlikely so we try deconning again.

dolma -c configs/baselines/decontamination/c4.yaml dedupe

check again:

aws s3 sync s3://ai2-llm/pretraining-data/sources/c4/v0/attributes/perplexity_suite_v3_option2_redo/ /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/ 

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/train/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/train/*.gz | awk '{sum += $1} END {print sum}'

364121142 / 364156258 = 0.9999035689 doc retention rate

Mix

dolma -c configs/baselines/mixing/c4.json mix --processes 224

check the number of files to make sure its > 224 (cpus on this machine)

aws s3 ls s3://ai2-llm/pretraining-data/sources/c4/v0_decon_ppl_suite_v3/ | grep .json.gz | wc -l

496 files

Tokenize

dolma -c configs/baselines/tokenization/c4.yaml tokens
batch_size: 10000
debug: false
destination: s3://ai2-llm/preprocessed/c4/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/c4/v0_decon_ppl_suite_v3/*.json.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 224
ring_size: 8
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
  input: /mnt/tank/dolma_tmp/c4_input_tokenized
  output: /mnt/tank/dolma_tmp/c4_output_tokenized
memmaps: 233m [1:17:23, 19.9s/m]
tokens: 174Gt [1:17:23, 37.6Mt/s]]  
documents: 364Md [1:17:23, 78.4kd/s]
files: 496f [1:17:23, 9.36s/f]m]
IanMagnusson commented 11 months ago

Now mc4:

conda activate dolma-baselines-fixed
export TMPDIR=/mnt/tank/tmp/

dedup

dolma -c configs/baselines/decontamination/mc4.yaml dedupe

Check removal

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/ai2-llm/pretraining-data/sources/mc4/en-wimbd-splits/attributes/perplexity_suite_v3_option2/train/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/ai2-llm/pretraining-data/sources/mc4/en-wimbd-splits/attributes/perplexity_suite_v3_option2/train/*.gz  | awk '{sum += $1} END {print sum}'

3928652800 / 3928733374 = 0.9999794911

Mix

dolma -c configs/baselines/mixing/mc4.json mix --processes 224

Tokenize

dolma -c configs/baselines/tokenization/mc4.yaml tokens
IanMagnusson commented 11 months ago

Now we'll make a dolma-cc-only dataset. This just needs tokenization but for some reason it needs the code at main on afab18c9b4f48be9a4df27552afb79b6e2a2a745

conda create -n dolma-main-latest python=3.10
conda activate dolma-main-latest
mv target/wheels/ target/wheels_bak
make setup
maturin build -r
pip install target/wheels/dolma-0.9.2-cp310-cp310-manylinux_2_34_x86_64.whl

Then tokenize:

dolma -c configs/baselines/tokenization/dolma_v1_5_cc_only.yaml tokens