allen c4 dataset problem (using rmt_laser.py)

Traceback (most recent call last):
  File "/workspace/laserRMT/rmt_laser.py", line 199, in <module>
    loop_check, min_loss = modifier.search_optimal_layer_modification(layer_types=['mlp.down_proj', 'mlp.up_proj', 'self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj', 'self_attn.o_proj'],
  File "/workspace/laserRMT/rmt_laser.py", line 132, in search_optimal_layer_modification
    initial_perplexity = self.calculate_model_perplexity()
  File "/workspace/laserRMT/rmt_laser.py", line 101, in calculate_model_perplexity
    input_tok = gptq_data_utils.get_test_tokens(dataset, seed=0, seqlen=seqlen, model=model_str)
  File "/workspace/laserRMT/lib/utils/gptq_data_utils.py", line 196, in get_test_tokens
    return get_c4_new(train_samples, seed, seqlen, model)[1].input_ids
  File "/workspace/laserRMT/lib/utils/gptq_data_utils.py", line 134, in get_c4_new
    traindata = load_dataset(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1852, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 539, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']

Can you fix it? It's impossible to use the laser RMT method and I don't know what I need to modify in the lib/utils folder.

I see that allenai--c4 is called here in gptq_data_utils:

@lru_cache
def get_c4(nsamples, seed, seqlen, model):
    from datasets import load_dataset
    traindata = load_dataset(
        'allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train'
    )
    valdata = load_dataset(
        'allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation'
    )

But I don't know what I need to modify to fix it.

It seems it is a problem on Allen, not on our code. It has stopped working somehow, from their side. I suggest you to comment out the c4 dataset in the perplexity calculation. I'm reluctant to change it, considering they might get it working again.

Traceback (most recent call last):
  File "/workspace/laserRMT/rmt_laser.py", line 199, in <module>
    loop_check, min_loss = modifier.search_optimal_layer_modification(layer_types=['mlp.down_proj', 'mlp.up_proj', 'self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj', 'self_attn.o_proj'],
  File "/workspace/laserRMT/rmt_laser.py", line 132, in search_optimal_layer_modification
    initial_perplexity = self.calculate_model_perplexity()
  File "/workspace/laserRMT/rmt_laser.py", line 101, in calculate_model_perplexity
    input_tok = gptq_data_utils.get_test_tokens(dataset, seed=0, seqlen=seqlen, model=model_str)
  File "/workspace/laserRMT/lib/utils/gptq_data_utils.py", line 196, in get_test_tokens
    return get_c4_new(train_samples, seed, seqlen, model)[1].input_ids
  File "/workspace/laserRMT/lib/utils/gptq_data_utils.py", line 134, in get_c4_new
    traindata = load_dataset(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1852, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 539, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']

Can you fix it? It's impossible to use the laser RMT method and I don't know what I need to modify in the lib/utils folder.

I see that allenai--c4 is called here in gptq_data_utils:

@lru_cache
def get_c4(nsamples, seed, seqlen, model):
    from datasets import load_dataset
    traindata = load_dataset(
        'allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train'
    )
    valdata = load_dataset(
        'allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation'
    )

But I don't know what I need to modify to fix it.

The following should work. However its a huge dataset and the current implementation loads and tokinzes the training set, which is not even used... Will do a PR in a sec, that fixes this!

data_files = {"validation": "en/c4-validation.00000-of-00008.json.gz"}
val_data = load_dataset("allenai/c4", data_files=data_files, split="validation")

/Edit @fernando-indrema added it to this PR.

Traceback (most recent call last):
  File "/workspace/laserRMT/rmt_laser.py", line 199, in <module>
    loop_check, min_loss = modifier.search_optimal_layer_modification(layer_types=['mlp.down_proj', 'mlp.up_proj', 'self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj', 'self_attn.o_proj'],
  File "/workspace/laserRMT/rmt_laser.py", line 132, in search_optimal_layer_modification
    initial_perplexity = self.calculate_model_perplexity()
  File "/workspace/laserRMT/rmt_laser.py", line 101, in calculate_model_perplexity
    input_tok = gptq_data_utils.get_test_tokens(dataset, seed=0, seqlen=seqlen, model=model_str)
  File "/workspace/laserRMT/lib/utils/gptq_data_utils.py", line 196, in get_test_tokens
    return get_c4_new(train_samples, seed, seqlen, model)[1].input_ids
  File "/workspace/laserRMT/lib/utils/gptq_data_utils.py", line 134, in get_c4_new
    traindata = load_dataset(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1852, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 539, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']

Can you fix it? It's impossible to use the laser RMT method and I don't know what I need to modify in the lib/utils folder.

I see that allenai--c4 is called here in gptq_data_utils:

@lru_cache
def get_c4(nsamples, seed, seqlen, model):
    from datasets import load_dataset
    traindata = load_dataset(
        'allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train'
    )
    valdata = load_dataset(
        'allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation'
    )

But I don't know what I need to modify to fix it.

This should be fixed with the merged PR. Can you confirm @Undi95 ?

Going to try right now and tell you! EDIT: I let the tools run for 6h with a 70B and it was stuck here with no CPU/GPU usage (but VRAM full)

Loading checkpoint shards: 100%|████████████████| 29/29 [01:16<00:00,  2.63s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Reconstructing layer: model.layers.25.mlp.down_proj
Reduced from torch.Size([8192]) to 7113
Layer mlp.down_proj_25 has already been modified. Skipping.
Restored original weights for layer: model.layers.25.mlp.down_proj
Reconstructing layer: model.layers.25.mlp.down_proj
Reduced from torch.Size([8192]) to 7113
Restored original weights for layer: model.layers.25.mlp.down_proj
['.79.', '.78.', '.77.', '.76.', '.75.', '.74.', '.73.', '.72.', '.71.', '.70.', '.69.', '.68.', '.67.', '.66.', '.65.', '.64.', '.63.', '.62.', '.61.', '.60.', '.59.', '.58.', '.57.', '.56.', '.55.', '.54.', '.53.', '.52.', '.51.', '.50.', '.49.', '.48.', '.47.', '.46.', '.45.', '.44.', '.43.', '.42.', '.41.', '.40.', '.39.', '.38.', '.37.', '.36.', '.35.', '.34.', '.33.', '.32.', '.31.', '.30.', '.29.', '.28.', '.27.', '.26.', '.25.', '.24.', '.23.', '.22.', '.21.', '.20.', '.19.', '.18.', '.17.', '.16.', '.15.', '.14.', '.13.', '.12.', '.11.', '.10.', '.9.', '.8.', '.7.', '.6.', '.5.', '.4.', '.3.', '.2.', '.1.', '.0.']
get_wikitext2
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
get_wikitext2 testenc
get_wikitext2 test_enc {'input_ids': tensor([[    1, 29871,    13,  ...,    13,    13,    13]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}
avg_loss = 1.6654343110235792: 100%|██████████| 889/889 [04:57<00:00,  2.99it/s]
get_c4
Resolving data files: 100%|█████████████████| 1024/1024 [00:30<00:00, 33.45it/s]
Resolving data files: 100%|████████████████| 1024/1024 [00:02<00:00, 360.85it/s]
Resolving data files: 100%|█████████████| 7168/7168 [00:00<00:00, 201327.04it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 125086.42it/s]
Resolving data files: 100%|███████████████| 512/512 [00:00<00:00, 133732.95it/s]
Resolving data files: 100%|███████████| 59024/59024 [00:00<00:00, 258410.34it/s]
Resolving data files: 100%|███████████████| 386/386 [00:00<00:00, 133625.07it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 139664.65it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 163164.05it/s]
Resolving data files: 100%|███████████████| 256/256 [00:00<00:00, 156248.81it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 195154.82it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 153814.68it/s]
Resolving data files: 100%|███████████████| 512/512 [00:00<00:00, 136808.54it/s]
Resolving data files: 100%|███████████████| 512/512 [00:00<00:00, 160511.52it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 161683.76it/s]
Resolving data files: 100%|███████████████| 256/256 [00:00<00:00, 121230.87it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 159854.37it/s]
Resolving data files: 100%|█████████████| 2048/2048 [00:00<00:00, 188173.55it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 153107.35it/s]
Resolving data files: 100%|███████████| 11264/11264 [00:00<00:00, 207538.30it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 141505.25it/s]
Resolving data files: 100%|█████████████████| 32/32 [00:00<00:00, 159403.48it/s]
Resolving data files: 100%|█████████████| 2048/2048 [00:00<00:00, 152775.13it/s]
Resolving data files: 100%|███████████████| 256/256 [00:00<00:00, 130609.64it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 215437.77it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 155631.67it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 145888.83it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 174422.00it/s]
Resolving data files: 100%|█████████████| 2048/2048 [00:00<00:00, 182419.13it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 116407.40it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 220571.45it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 158141.58it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 161921.48it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 228455.71it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 149827.92it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 180703.77it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 139329.37it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 169828.68it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 161860.46it/s]
Resolving data files: 100%|███████████████| 256/256 [00:00<00:00, 131489.32it/s]
Resolving data files: 100%|███████████████| 256/256 [00:00<00:00, 119225.16it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 216305.77it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 115754.83it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 149640.00it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 183608.38it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 217885.92it/s]
Resolving data files: 100%|█████████████████| 32/32 [00:00<00:00, 136817.26it/s]
Resolving data files: 100%|███████████████| 512/512 [00:00<00:00, 138833.96it/s]
Resolving data files: 100%|███████████████| 256/256 [00:00<00:00, 122601.26it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 147249.29it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 126026.04it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 125672.03it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 140473.17it/s]
Resolving data files: 100%|███████████████| 512/512 [00:00<00:00, 131650.54it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 152217.44it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 181620.74it/s]
Resolving data files: 100%|███████████████| 256/256 [00:00<00:00, 125349.27it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 158433.26it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 164734.86it/s]
Resolving data files: 100%|█████████████████| 32/32 [00:00<00:00, 192841.56it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 134179.99it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 156773.52it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 157740.83it/s]
Resolving data files: 100%|█████████████| 4096/4096 [00:00<00:00, 165997.09it/s]
Resolving data files: 100%|█████████████████| 32/32 [00:00<00:00, 192564.89it/s]
Resolving data files: 100%|█████████████████| 32/32 [00:00<00:00, 184872.90it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 185511.72it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 183985.92it/s]
Resolving data files: 100%|███████████████| 512/512 [00:00<00:00, 157198.13it/s]
Resolving data files: 100%|███████████████| 256/256 [00:00<00:00, 128070.35it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 183107.41it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 170003.46it/s]
Resolving data files: 100%|███████████████| 256/256 [00:00<00:00, 149567.05it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 157445.92it/s]
Resolving data files: 100%|█████████████████| 32/32 [00:00<00:00, 186932.77it/s]
Resolving data files: 100%|███████████████| 256/256 [00:00<00:00, 131795.98it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 147249.29it/s]
Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 130816.50it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 150668.89it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 256247.68it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 160044.99it/s]
Resolving data files: 100%|█████████████| 3072/3072 [00:00<00:00, 158460.54it/s]
Resolving data files: 100%|█████████████████| 32/32 [00:00<00:00, 152693.66it/s]
Resolving data files: 100%|███████████████| 128/128 [00:00<00:00, 198326.90it/s]
Resolving data files: 100%|█████████████████| 32/32 [00:00<00:00, 187980.01it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 138057.45it/s]
Resolving data files: 100%|█████████████| 1024/1024 [00:00<00:00, 143909.11it/s]
Downloading data files:   0%|                             | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|                             | 0.00/40.5M [00:00<?, ?B/s]
Downloading data:  10%|██                  | 4.19M/40.5M [00:00<00:05, 6.62MB/s]
Downloading data:  31%|██████▏             | 12.6M/40.5M [00:00<00:01, 15.0MB/s]
Downloading data:  52%|██████████▎         | 21.0M/40.5M [00:01<00:00, 21.2MB/s]
Downloading data:  73%|██████████████▌     | 29.4M/40.5M [00:01<00:00, 26.0MB/s]
Downloading data: 100%|████████████████████| 40.5M/40.5M [00:01<00:00, 25.0MB/s]
Downloading data files: 100%|█████████████████████| 1/1 [00:01<00:00,  1.62s/it]
Extracting data files: 100%|██████████████████████| 1/1 [00:00<00:00,  1.94it/s]
Generating validation split: 45576 examples [00:00, 185250.12 examples/s]
get_c4 testenc downloaded Dataset({
    features: ['text', 'timestamp', 'url'],
    num_rows: 45576
})
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

So no more error but it freeze. EDIT2: After more hours I can confirm it don't work, I modify 31 to 79 in the file for the layers and it work for the two other script, I dunno what is the problem.

haha that freeze is either

HuggingFcae download endpoint being EXTREMELYY SLOW
Tokinzer taking a long time to tokinze the whole text.

Haha alright I retry as soon as I can

cognitivecomputations / laserRMT

allen c4 dataset problem (using rmt_laser.py) #12