szrrr04 commented 3 months ago

Hello! I have already trained the DCFormer on the Pile dataset, and now I would like to evaluate it on downstream tasks. I noticed that the Lambada evaluation set is mentioned in your paper, and it seems that it allows direct evaluation after loading the pre-trained model without the need for further fine-tuning, right? I would like to ask how to conduct evaluation on the trained model. Is it mainly about modifying the configuration file, setting only_eval to true, and then specifying the correct dataset_path? The key seems to be setting load_parameters_path, and that should mostly take care of it, right? Additionally, do I need to add a function to handle the Lambada data, modify the input pipeline script, and then adjust the evaluation metrics in the eval_step of the training script? Is that generally how it should be done?

hilbertmeng commented 3 months ago

@szrrr04 In practice, we first convert Maxtext models to Pytorch models, then evaluate downstream tasks with lm-evaluation-harness . However, the whole process is a bit complicated.

For model conversion, you can refer to the script maxtext2torch.py, which shows how to convert DCFormer-Medium(410M) trained in Maxtext.
For model evaluations, you need to modify lm-evaluation-harness repo to support DCFormer.

szrrr04 commented 3 months ago

Thank you for sharing all this information with me! I will take a closer look at it. However, I still have a question. I noticed that in your code, there's an item in the config called load_parameters_path. By correctly configuring this path to the checkpoints of a model I previously trained and making some modifications to the input_pipeline_interface script, I successfully managed to fine-tune and evaluate the model on the Lambada dataset using the pre-trained model. After comparison, I found that it indeed loads the parameters from the previously trained model. Is it okay to do it this way? Or is it necessary to convert it to PyTorch first? Thank you very much!

hilbertmeng commented 3 months ago

@szrrr04 Sure. It is okay to fine-tune and evaluate the trained model in jax, just like what you have done. You can manually download any dataset and write codes to evaluate the performance. But if you want to evaluate a variety of downstream tasks or compare performance with other language models, converting it to PyTorch and using lm-evaluation-harness is a more convenient and efficient way, because it enables testing and comparing models on a large number of different evaluation tasks in the same framework.

szrrr04 commented 3 months ago

Ok,I got it. Thank you!

szrrr04 commented 3 months ago

Hello, I noticed that the synthetic dataset you provided is only a single dataset, and based on the processing in the input_pipeline_interface script, it only returns the training data iterator without returning an evaluation data iterator. How should I evaluate the performance of a pre-trained model on this dataset? Is it by looking at the rate of decline in training loss? Or is there another approach?

hilbertmeng commented 3 months ago

@szrrr04 The synthetic dataset class in input_pipeline_interface.py is inherited from maxtext repo. I think it's just like a "placeholder" for debugging, so just ignore it. The synthetic dataset in our paper is provided here and we evaluated it with converted PyTorch models.

szrrr04 commented 3 months ago

Thank you！

发自我的iPhone

------------------ Original ------------------ From: Qingye Meng @.> Date: Fri,Aug 30,2024 6:54 PM To: Caiyun-AI/DCFormer @.> Cc: Zerah @.>, Mention @.> Subject: Re: [Caiyun-AI/DCFormer] Inquiry about downstream task evaluation(Issue #9)

@szrrr04 The synthetic dataset class in input_pipeline_interface.py is inherited from maxtext repo. I think it's just like a "placeholder" for debugging, so just ignore it. The synthetic dataset in our paper is provided here and we evaluated it with converted PyTorch models.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

szrrr04 commented 3 months ago

Hello, I encountered an issue while converting JAX to PyTorch using the maxtext2torch.py you provided. The specific error messages are as follows:

It is known that my training process was conducted exactly as you instructed, and the model checkpoints were correctly saved and correctly loaded here. Why is this issue occurring?

WARNING:absl:Configured CheckpointManager using deprecated legacy API. Please follow the instructions at https://orbax.readthedocs.io/en/latest/api_refactor.html to migrate by August 1st, 2024. WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/0/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/1000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/2000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/3000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/4000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/5000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/6000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/7000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/8000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/9000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/10000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/11000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/12000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/13000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/14000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/15000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/16000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/17000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/18000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/19000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/20000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/21000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/22000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/23000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/24000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/25000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/26000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/27000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/28000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/29000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/30000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/31000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/32000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/33000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/34000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/35000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/36000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/37000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/38000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/39000/_CHECKPOINT_METADATA WARNING:absl:CheckpointMetadata file does not exist: /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints/39000/_CHECKPOINT_METADATA /home/u2022212091/.conda/envs/20240816/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py:1407: UserWarning: Couldn't find sharding info under RestoreArgs. Populating sharding info from sharding file. Please note restoration time will be slightly increased due to reading from file instead of directly from RestoreArgs. Note also that this option is unsafe when restoring on a different topology than the checkpoint was saved with. warnings.warn( ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) ERROR:root:cannot reshape array of size 1 into shape (1,4,1,1,1,1) read_dir /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints load_step 39000 read_dir here /home/u2022212091/u2022212091/DcFormer/jax/output/checkpoints load_step 39000 Traceback (most recent call last): File "/home/u2022212091/u2022212091/DCFormer-pytorch/maxtext2torch.py", line 109, in weights = load_model(read_dir, load_step=load_step) File "/home/u2022212091/u2022212091/DCFormer-pytorch/maxtext2torch.py", line 36, in load_model weights = mngr.restore(load_step, items=item) File "/home/u2022212091/.conda/envs/20240816/lib/python3.10/site-packages/orbax/checkpoint/checkpoint_manager.py", line 1207, in restore restored = self._checkpointer.restore(restore_directory, args=args) File "/home/u2022212091/.conda/envs/20240816/lib/python3.10/site-packages/orbax/checkpoint/checkpointer.py", line 211, in restore restored = self._handler.restore(directory, args=ckpt_args) File "/home/u2022212091/.conda/envs/20240816/lib/python3.10/site-packages/orbax/checkpoint/composite_checkpoint_handler.py", line 471, in restore restored[item_name] = handler.restore( File "/home/u2022212091/.conda/envs/20240816/lib/python3.10/site-packages/orbax/checkpoint/pytree_checkpoint_handler.py", line 642, in restore return self._handler_impl.restore(directory, args=args) File "/home/u2022212091/.conda/envs/20240816/lib/python3.10/site-packages/orbax/checkpoint/base_pytree_checkpoint_handler.py", line 796, in restore restored_item = asyncio.run( File "/home/u2022212091/.conda/envs/20240816/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/home/u2022212091/.conda/envs/20240816/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/home/u2022212091/.conda/envs/20240816/lib/python3.10/site-packages/orbax/checkpoint/base_pytree_checkpoint_handler.py", line 660, in _maybe_deserialize deserialized_batches += await asyncio.gather(deserialized_batches_ops) File "/home/u2022212091/.conda/envs/20240816/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1461, in deserialize ret = await asyncio.gather(deserialize_ops) File "/home/u2022212091/.conda/envs/20240816/lib/python3.10/site-packages/orbax/checkpoint/serialization.py", line 315, in async_deserialize raise ValueError( ValueError: sharding passed to deserialization should be specified, concrete and an instance of jax.sharding.Sharding. Got None

Lisennlp commented 3 months ago

Because the machine where you load the model is different from the machine where you train the model, resulting in different sharding strategies, you need to manually delete the _sharding file in the model folder.

szrrr04 commented 3 months ago

I followed your suggestion and deleted the _sharding file, but it still doesn't work and the following error occurred. Does it have anything to do with this warning?

Lisennlp commented 3 months ago

You can try to modify the orbax-checkpoint version.

szrrr04 commented 3 months ago

Thank you for your suggestion! I tried upgrading and downgrading the orbax-checkpoint version, but the issue wasn't resolved. After that, I created a new virtual environment and reinstalled the requirements (still using orbax-checkpoint=5.2.0). I found that the previous issue was resolved, but now I'm encountering the following problem. Afterward, I tried downgrading or upgrading the orbax-checkpoint version, but the same error persisted. I would like to ask, is your Python version 3.10? I just want to confirm that this is an environment compatibility issue, not a problem with the script or the checkpoints I trained. Thank you!

Lisennlp commented 3 months ago

Thank you for your suggestion! I tried upgrading and downgrading the orbax-checkpoint version, but the issue wasn't resolved. After that, I created a new virtual environment and reinstalled the requirements (still using orbax-checkpoint=5.2.0). I found that the previous issue was resolved, but now I'm encountering the following problem. Afterward, I tried downgrading or upgrading the orbax-checkpoint version, but the same error persisted. I would like to ask, is your Python version 3.10? I just want to confirm that this is an environment compatibility issue, not a problem with the script or the checkpoints I trained. Thank you!

It's not python problem.(python==3.10 is right). Due to compatibility issues with older model loading code following the update of the orbax-checkpoint package, sometimes the model loading code needs slight adjustments. I’ve just updated a script that converts MaxText to the PyTorch format. You can try using this updated script. Before doing so, you’ll need to downgrade your orbax-checkpoint to version 0.2.6.

szrrr04 commented 3 months ago

Thank you very much! I followed your new requirements.txt and the updated maxtext2torch.py, and I noticed that some of the previous warnings have disappeared. However, the loop lock error still persists. Then it suddenly occurred to me that this might be because I installed jax[cuda11_pip]=0.4.30, while during training, the JAX version I used was jax[cuda12_pip]=0.4.30, and my compiler is also CUDA 12.2. So, I decided to switch to jax[cuda12_pip]=0.4.30, but I encountered conflicts between these packages (mainly between it and torch). So, I didn't specify the torch version and let pip adjust it freely, and it ended up installing torch 2.4.0. However, after running it, the loop lock issue still occurred. This is really strange... I believe I followed your instructions completely. I'm running this on a supercomputing platform. First, I created a brand-new conda virtual environment with Python 3.10, then I installed the requirements from the requirements.txt file you provided using pip. After that, I ran the latest version of the maxtext2torch.py script you provided. This is my sbatch script. I think everything should be fine, but why does this loop lock error keep occurring? /(ㄒoㄒ)/~~ My compiler is CUDA 12.2, but I don't think that should be an issue since this conversion script is running on the CPU. Plus, my compiler version is the same as the one I used during training.

May I ask what else could be causing this issue?

Lisennlp commented 3 months ago

I checked this error and saw it in the orbax issue. Someone said it was a python problem, but it was strange because this problem was solved after python 3.10. Someone in the issue said that it was fixed by upgrading python to 3.11. You can try it. By the way, the python version we use is 3.10.10. I don't know if the minor version after 3.10 has any impact.

szrrr04 commented 3 months ago

Thank you so much! I've upgraded the Python version to 3.10.10, and the compatibility issue is finally resolved! However, a new minor problem has come up, which seems to be related to shape transformation in the script. Does the script need to be modified to fix this? It looks like there's a minor issue in the update_weight_from_maxtext function.

hilbertmeng commented 3 months ago

@szrrr04 This error is because the product of head_dim and num_heads is not equal to model_dim. As the image above shown, head_dim in your case is 128, not 64 (1024 // 16). You can temporarily modify this line to specify head_dim manually as follows.

N, E, H, D = vocab_size, model_dim, num_heads, 128

Note that this is not a common setting of head_dim, so be careful of comparing it with other same-sized models.

szrrr04 commented 3 months ago

Thank you! I followed your suggestions, but I encountered the following errors. The main issue is that I haven't made any modifications to the model configuration or attention mechanism in my DCFormer-JAX training code; I just used it directly for training (I'm training with dcformer_405m.yml). Based on the error, it seems like the weight dimensions of the initialized PyTorch model don't quite match the dimensions of my pre-trained weights. Traceback (most recent call last): File "/home/u2022212091/u2022212091/DCFormer-pytorch/maxtext2torch.py", line 116, in model = update_weight_from_maxtext(model, weights, vocab_size=50256, num_blocks=2, model_dim=1024, num_heads=16) File "/home/u2022212091/u2022212091/DCFormer-pytorch/maxtext2torch.py", line 84, in update_weight_from_maxtext model.load_state_dict(state_dict, strict=False) File "/home/u2022212091/.conda/envs/20240902/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2189, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for DCFormer: size mismatch for layers.0.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.0.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.0.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.0.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.1.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.1.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.1.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.1.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.2.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.2.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.2.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.2.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.3.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.3.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.3.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.3.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.4.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.4.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.4.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.4.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.5.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.5.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.5.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.5.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.6.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.6.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.6.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.6.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.7.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.7.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.7.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.7.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.8.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.8.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.8.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.8.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.9.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.9.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.9.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.9.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.10.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.10.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.10.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.10.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.11.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.11.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.11.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.11.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.12.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.12.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.12.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.12.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.13.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.13.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.13.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.13.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.14.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.14.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.14.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.14.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.15.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.15.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.15.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.15.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.16.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.16.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.16.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.16.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.17.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.17.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.17.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.17.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.18.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.18.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.18.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.18.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.19.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.19.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.19.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.19.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.20.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.20.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.20.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.20.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.21.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.21.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.21.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.21.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.22.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.22.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.22.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.22.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.23.attention.wqkv.weight: copying a param with shape torch.Size([6144, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 1024]). size mismatch for layers.23.attention.wo.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for layers.23.attention.q_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for layers.23.attention.k_norm.scale: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]). There was also an additional warning：/home/u2022212091/u2022212091/DCFormer-pytorch/maxtext2torch.py:82: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor). state_dict[k] = torch.tensor(v)

Lisennlp commented 3 months ago

Sorry! the head_dim was set 128 in dcformer_405m.yml which is a mistake, Actually it should be 64, because model_dim is 1024, base_num_query_heads is 16, 1024 // 16 should be equal to 64.

szrrr04 commented 3 months ago

No problem, so I need to first change the head_dim in dcformer_405m.yml to 64, then retrain the model before converting it to PyTorch and proceeding with the subsequent evaluation, right?

Lisennlp commented 3 months ago

Yes, the problem you are facing now is just the shape of the model. You can basically move on after changing it.

szrrr04 commented 3 months ago

Ok，thank u！

Lisennlp commented 3 months ago

No need to change anything else. Do you mean that you can run it with head_dim set to 128, but it out of memory when you change it to 64?

szrrr04 commented 3 months ago

I just realized that after changing head_dim back to 128, the same issue still occurred. I'll try again and switch to a different partition of the cluster to see if that helps.Sorry！ —————————————————————————————————————————————————————— After I switched to a different partition, everything worked fine again.

szrrr04 commented 3 months ago

Hello! After looking into it, I was wondering if the simplest way to evaluate my model using the lm-evaluation library after training is to first upload my model to Hugging Face and then use the command-line evaluation method provided by the lm-evaluation library directly? I noticed that you've uploaded your models to Hugging Face, but since I've made some modifications to the base DCFormer (I have two trained models, one without modifications and one with), I can't use yours and will need to upload new ones myself.

I want to ask, is the pytorch_model.bin file the same as the one generated after I converted the weights using the maxtext2torch.py script? Do I just need to rename the file? Also, if I made some modifications to the dc_attention script in the JAX training framework, do I also need to make the same modifications to the modeling_dcformer script in the PyTorch framework? Other scripts shouldn't need much modification, right?

Thank you!

hilbertmeng commented 3 months ago

@szrrr04

You don't have to upload your trained model to Hugging Face for evaluations. Instead, you can load your local models by modifying lm-evaluation-harness repo, which is more convenient in my view. (eg. loading model code, creating tokenizer code)
The pytorch_model.bin file is just the model weight file obtained from the maxtext2torch.py script except the name itself. Renaming it is for meeting the requirement of Hugging Face. You can load models as follows after converting it to PyTorch, and insert this part to lm-evaluation-harness repo without renaming the model.
```
from modeling_dcformer import DCFormer
dcformer = DCFormer.from_pretrained("the_local_model_folder_path")
```
If you modify dc_attention.py in Jax, you need to change modeling_dcformer.py likewise to match the model architecture. In addition, if you add new parameters or change existing parameters in your new models, you have to match those modified parameters of Jax and PyTorch in maxtext2torch.py. To verify the alignment, you can compare the loss of Jax model and PyTorch model on a mini-dataset (10k tokens is enough empirically), and the difference of loss is around 0.001.

szrrr04 commented 2 months ago

Thank you! My model conversion has now been successfully completed! I carefully read your response, but I still didn’t quite understand it. I’m really sorry about that. This is my first time training and evaluating a large model, so I might be a bit slow in grasping some concepts. Here’s my current understanding:

First, I fork this repository to my own and enable editing mode:

After forking...

git clone https://github.com//lm-evaluation-harness.git cd lm-evaluation-harness pip install -e ".[dev]" Then I can modify the scripts on my side and apply the changes immediately, right?

Next, do I need to modify the file lm-eval/models/huggingface.py? According to the section you pointed out: self._model = self.AUTO_MODEL_CLASS.from_pretrained( pretrained, revision=revision, torch_dtype=get_dtype(dtype), trust_remote_code=trust_remote_code, **model_kwargs, ) Should I replace pretrained with the path to my own folder here? Or how should this be done?

from modeling_dcformer import DCFormer dcformer = DCFormer.from_pretrained("the_local_model_folder_path") And where exactly should this code be added? Is it also in the huggingface.py file?

Additionally, I noticed that in the folder generated by maxtext2torch.py, there are three files: config.json, generation_config.json, and pytorch_model.bin. So, should the_local_model_folder_path just point directly to this folder containing these three files?

Lastly, can I use a command like this to evaluate? I saw an example in the examples folder that looks like this…just modify pretrained parameters to my folder path？

hilbertmeng commented 2 months ago

@szrrr04

So, should the_local_model_folder_path just point directly to this folder containing these three files?

Yes, exactly.

You can modify lm-eval/models/huggingface.py in this way.

Copy modeling_dcformer.py and configuration_dcformer.py to lm-eval/models/
Add from modeling_dcformer import DCFormer at the beginning of lm-eval/models/huggingface.py.
Modify the part related to loading model and tokenizer. Here we assume the model folder contain a DCFormer string.

For loading model part, we can load it locally.

if 'DCFormer' in pretrained:
    self._model = DCFormer.from_pretrained("the_local_model_folder_path") 
    # you can also do some custom modification to _model here 
else:
    self._model = self.AUTO_MODEL_CLASS.from_pretrained(
        pretrained,
        revision=revision,
        torch_dtype=get_dtype(dtype),
        trust_remote_code=trust_remote_code,
        **model_kwargs,
    )

For loading tokenizer part, we assume that the tokenizer does not change.

if 'DCFormer' in pretrained:
    self.tokenizer = transformers.AutoTokenizer.from_pretrained("Caiyun-AI/DCFormer-2.8B")
else:
    self._create_tokenizer(
        pretrained,
        tokenizer,
        revision=revision,
        trust_remote_code=trust_remote_code,
        use_fast_tokenizer=use_fast_tokenizer,
    )

Finally, you can evaluate with the following command.

lm_eval --model hf --model_args pretrained=DCFormer --tasks demo_boolq

szrrr04 commented 2 months ago

Thank you so much! I have made all the changes exactly as you suggested, adding the necessary code and files. This is my command line when running the script. However, I encountered this error saying that transformers couldn't recognize dcformer. Do I need to modify something else in the Hugging Face scripts?

hilbertmeng commented 2 months ago

@szrrr04 Can you provide the full stack of error msg?

szrrr04 commented 2 months ago

微信图片_20240909154546 Sorry, I just realized that the image upload failed.

hilbertmeng commented 2 months ago

@szrrr04 You need to modify get_config.

from configuration_dcformer import DCFormerConfig # also place it at the beginning

if 'DCFormer' in pretrained:
    self._config = DCFormerConfig.from_pretrained("the_local_model_folder_path")
else:
    self._get_config(
        pretrained,
        revision=revision,
        trust_remote_code=trust_remote_code,
    )

szrrr04 commented 2 months ago

Thank you, all the previous issues have been resolved! However, I found that the cluster I am using cannot access Hugging Face, so there was an error when loading the dataset（the tokenizer issue has been fixed）. Should I download the dataset I want to evaluate first? Then modify self.dataset = datasets.load_dataset in lm-evaluation-harness/lm_eval/api/task.py to use something like from datasets import load_from_disk and dataset = load_from_disk("/path/to/lambada") to load it from a local path?

hilbertmeng commented 2 months ago

It looks like a good workaround and you can try it. I didn't meet this error.

szrrr04 commented 2 months ago

I took a different approach by running the evaluation on Kaggle’s Jupyter Notebook, which allows me to avoid the network issues. However, I’m now encountering this error. It says that an assertion was raised in the modeling_dcformer.py file. Since I haven't made any modifications to the modeling_dcformer.py file or any other files related to the evaluation process（Because this is the version I haven't modified.）, I would like to ask why this is happening. Could you provide any insights?

hilbertmeng commented 2 months ago

Generally, there are two types of evaluations of LLMs: prefill (comparing log prob of a sentence) and generation (generating answer with the given prompt). Temporally, in lm-evaluation-harness, we can only evaluate DCFormer on prefill tasks (eg. lambda). For generation tasks (eg. GSM8K), we need further modification due to the custom kw_cache. However, most tasks belongs to prefill, so you can still evaluate many of them.

To solve the above error, you need to setup caches for forward after loading the model. In the caches, freqs_cis is used in prefill and generation, while kv_cache is used only in generation. Here we take max_batch_size as 1 and max_seq_length as 2048. The parameter max_seq_length should keep same to that in training. The parameter max_batch_size is used in generation, which takes no effect on prefill.

if 'DCFormer' in pretrained:
    self._model = DCFormer.from_pretrained("the_local_model_folder_path") 
    with torch.device(self.device):
        self._model.setup_caches(max_batch_size=1, max_seq_length=2048, set_kv_cache=True)
else:
    self._model = self.AUTO_MODEL_CLASS.from_pretrained(
        pretrained,
        revision=revision,
        torch_dtype=get_dtype(dtype),
        trust_remote_code=trust_remote_code,
        **model_kwargs,
    )

szrrr04 commented 2 months ago

Thank you! I've resolved the previous issue, but now I'm encountering the following problem. It seems like my indices and related tensors are not on the same device. Should I continue modifying the HuggingFace script to fix this? File "/opt/conda/bin/lm_eval", line 8, in sys.exit(cli_evaluate()) File "/kaggle/working/lm-evaluation-harness/lm_eval/main.py", line 382, in cli_evaluate results = evaluator.simple_evaluate( File "/kaggle/working/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper return fn(*args, kwargs) File "/kaggle/working/lm-evaluation-harness/lm_eval/evaluator.py", line 301, in simple_evaluate results = evaluate( File "/kaggle/working/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper return fn(args, kwargs) File "/kaggle/working/lm-evaluation-harness/lm_eval/evaluator.py", line 476, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/kaggle/working/lm-evaluation-harness/lm_eval/api/model.py", line 378, in loglikelihood return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm) File "/kaggle/working/lm-evaluation-harness/lm_eval/models/huggingface.py", line 1124, in _loglikelihood_tokens self._model_call(batched_inps, call_kwargs), dim=-1 File "/kaggle/working/lm-evaluation-harness/lm_eval/models/huggingface.py", line 836, in _model_call return self.model(inps).logits File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/kaggle/working/lm-evaluation-harness/lm_eval/models/modeling_dcformer.py", line 160, in forward freqs_cis = self.freqs_cis[input_pos][:idx.shape[-1]] RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) Running loglikelihood requests: 0%| | 0/10306 [00:00<?, ?it/s]

hilbertmeng commented 2 months ago

For simplicity, you can make sure inputs and the model share the same device in the forward function.

# add code below the line https://github.com/Caiyun-AI/DCFormer/blob/main/pytorch/dcformer/modeling_dcformer.py#L153
idx = idx.to(self.device)

If it doesn't work, you should print devices of the model, freqs_cis and inputs, then make them in the same device.

szrrr04 commented 2 months ago

Thank you so much! I printed the devices of the model, freqs_cis, and inputs, then made sure they were on the same device, which solved the issue. After that, I also resolved some issues related to float32 and float16 not being able to compute together. Now, suddenly, this issue has appeared: Traceback (most recent call last): File "/opt/conda/bin/lm_eval", line 8, in sys.exit(cli_evaluate()) File "/kaggle/working/lm-evaluation-harness/lm_eval/main.py", line 382, in cli_evaluate results = evaluator.simple_evaluate( File "/kaggle/working/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper return fn(*args, kwargs) File "/kaggle/working/lm-evaluation-harness/lm_eval/evaluator.py", line 301, in simple_evaluate results = evaluate( File "/kaggle/working/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper return fn(args, kwargs) File "/kaggle/working/lm-evaluation-harness/lm_eval/evaluator.py", line 476, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/kaggle/working/lm-evaluation-harness/lm_eval/api/model.py", line 378, in loglikelihood return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm) File "/kaggle/working/lm-evaluation-harness/lm_eval/models/huggingface.py", line 1124, in _loglikelihood_tokens self._model_call(batched_inps, call_kwargs), dim=-1 File "/kaggle/working/lm-evaluation-harness/lm_eval/models/huggingface.py", line 836, in _model_call return self.model(inps).logits File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/kaggle/working/lm-evaluation-harness/lm_eval/models/modeling_dcformer.py", line 168, in forward x = self.tok_embeddings(idx) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 164, in forward return F.embedding( File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2267, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

After this, I added some debug print statements in the modeling_dcformer script and found the following issue: It seems that the model's index might be exceeding the vocabulary range. How can I solve this?

Here is the latest version of the modeling_dcformer script after fixing some bugs. https://gist.github.com/szrrr04/0e0241ce2c2e65b62c50fd78e89584d8

I suddenly thought, could the issue be related to the tokenizer? I loaded the tokenizer for dcformer-2.8B that you uploaded to Huggingface, but the model I’m using is dcformer-405M. Do you think this is the problem? Could you provide the tokenizer for dcformer-405M? Thank you!

hilbertmeng commented 2 months ago

Sorry, there is a mistake of vocab_size setting in the training config. The best solution is to revise vocab_size in the training config to 50432, then retrain and evaluate the model. Besides, dcformer-2.8B and dcformer-405M use the same tokenizer, so it's unrelated.

The other solution is to clamp token index out of bound to the maximum, because those tokens are composed of different numbers of spaces (https://huggingface.co/Caiyun-AI/DCFormer-2.8B/raw/main/tokenizer.json). On the other hand, tokens exceeding the vocab size are less frequent and will have a very minor effect on evaluation. You can temporarily modify the forward function to bypass the error as follows.

# below this line https://gist.github.com/szrrr04/0e0241ce2c2e65b62c50fd78e89584d8#file-gistfile1-txt-L159
idx = idx.clamp(max=self.tok_embeddings.num_embeddings-1)

szrrr04 commented 2 months ago

Thank you! I used the second method and successfully completed the evaluation on the LAMBADA dataset! However, I noticed that I couldn't evaluate on other datasets (I tried RACE, PIQA, etc.), and similar errors like the one below regarding out-of-bounds idx were thrown.

So why does the second method only work for certain datasets after modification?

hilbertmeng commented 2 months ago

You should first identify the actual error by setting an environment variable, then fix it.

CUDA_LAUNCH_BLOCKING=1 lm_eval -model hf ....

Because the execution in PyTorch is asynchronous, the above error stack is misleading.

szrrr04 commented 2 months ago

I have already set CUDA_LAUNCH_BLOCKING=1 in the command line above, so this is the correct traceback. However, I have fixed it myself. I handled the error in the Huggingface script in the same way, by adding this line: cont_toks = cont_toks.clamp(max=logits.size(-1) - 1), which allows the evaluation to run correctly. May I ask if this will also only have a minor impact on the evaluation results?

hilbertmeng commented 2 months ago

Yes, I think clamping cont_toks will also have a minor impact on the final result. In the code comment, cont_toks refer to continuation tokens, which may contain tokens out of bound due to padding or a number of spaces.

You can also print token index lists with tokens out of bound, then use tokenizer to decode those sentences and see what really happens. I checked the last token in the vocabulary and reproduced the case where tokens out of bound occur in the following image.

szrrr04 commented 2 months ago

Thank you！I got it.

发自我的iPhone

------------------ Original ------------------ From: Qingye Meng @.> Date: Thu,Sep 12,2024 7:44 PM To: Caiyun-AI/DCFormer @.> Cc: Zerah @.>, Mention @.> Subject: Re: [Caiyun-AI/DCFormer] Inquiry about downstream task evaluation(Issue #9)

Yes, I think clamping cont_toks will also have a minor impact on the final result. In the code comment, cont_toks refer to continuation tokens, which may contain tokens out of bound due to padding or a number of spaces.

You can also print token index lists with tokens out of bound, then use tokenizer to decode those sentences and see what really happens. I checked the last token in the vocabulary and reproduced the case where tokens out of bound occur in the following image. image.png (view on web) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Caiyun-AI / DCFormer

Inquiry about downstream task evaluation #9

After forking...