Closed pbarragan closed 1 year ago
As an update, in continuing to poke around to try to see if I could understand what was wrong, I attempted to get the inspiring example from TRLX of the above code to run. Because I'm running on a HPC with no internet access that only has V100s available to it, I had to make some modifications:
tracker=None
in the TrainConfig
in the TRLConfig
to fully disable wandb. I tried some set of environment variables, but I could not seem to get that to work. I also tried to investigate setting wandb into offline mode as they recommend, but in digging into the TRLX source code, I think I would have to modify the source of the some of the trainers to make offline
a possibility. I decided not to go this far for this test.NousResearch/Llama-2-7b-hf
in the original file to facebook/opt-1.3b
from the notebook I was trying to recreate) which still nearly maxed out the GPU memory (nearly 30 [GB] which seemed larger than I expected).This seemed to work except for a few things.
[2023-10-05 02:35:13,743] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Using the latest cached version of the module from <the base of the set cache directory>/huggingface/modules/datasets_modules/datasets/imdb/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0 (last modified on Thu Oct 5 02:01:00 2023) since it couldn't be found locally at imdb.
[RANK 0] Initializing model: facebook/opt-1.3b
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[RANK 0] Starting training
[RANK 0] Collecting rollouts
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
HF_DATASETS_OFFLINE=1
and TRANSFORMERS_OFFLINE=1
environment variables have been set (although only the first one is relevant). I don't get this warning as you can see for the models.<the base of the home directory>/.local/lib/python3.9/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
warnings.warn(
I'm not sure if doing this experiment and seeing it work helps illuminate what might be going on with my original post. I haven't figured that out yet. The goal of this follow up was not to ask a host of new questions but only to provide a different data point that might help clarify what is and is not broken about the previous example.
Note that the only other difference is that I ran this as a python file while the previous post as an attempt using a Jupyter notebook on this machine.
However, when I reintroduce a Lora configuration, I get the same position_ids
error again.
peft_config=LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05,)
So I'm trying to investigate what about that is causing the problem and how to fix it.
Thanks in advance for any help.
Hi @pbarragan! Sorry for the late response. The position_ids
bug in question has been resolved, however I can't reproduce your hanging behaviour in my environment (which has access to the Internet). You can still disable checkpointing, if you suspect that's one possible reason for the hang, by setting checkpoint_interval
to 99999999
and save_best
to False
. The warning you observed is expected in this example, but also it is innocuous because of the tiny size of the reward model used.
@maxreciprocate , thank you so much for the reply and the fix in #566! Unfortunately, I need the checkpointing as part of my workflow, so I'm going to keep poking around to see what might be causing it. I'm still guessing it has something to do with file locking being disabled in some of the directories in the server I have to use for this, but I don't have enough details around this to ask a sensible question. Thank you so much! I'll try to give this a shot as soon as possible and reopen if something is still not working related to the position_ids
bug. Take care!
🐛 Describe the bug
Hi,
I'm very new to TRLX, PEFT, and Huggingface, so I'm not sure if I just have some simple configuration wrong, but I am trying to recreate the notebook here originally from this page. The notebook is a bit out of date, so I've slowly been working through various issues. So the things I've done so far are:
I am now running this code (modified slightly from the notebook):
With this code, I'm still get a few warnings of things I can fix up later, but I run into the following error:
I don't think the first line about git is related (it just happens to print out in the same section) although I don't actually know what that problem is either. But the type error I get at the bottom looks very similar to #416, so I thought I'd ask here if anyone has any clues. However, there are some additional interactions with PEFT in my backtrace, so I'm not sure if the problem is actually related to PEFT or something else entirely. I also thought that perhaps I needed different versions of some of these libraries. Anyways, I would appreciate any help. Thanks!
Which trlX version are you using?
Main
Additional system and package information
PEFT version 0.5.0, Transformers 4.33.3