Closed LiminalCrew closed 1 month ago
Try to upgrade transformers to version 4.38.1, if you think you are not capable of doing it or don't want to risk I can tell you the instructions.
Hey thanks for the reply. Thing is I've just installed it so it should already be the most recent version, no?
Hey thanks for the reply. Thing is I've just installed it so it should already be the most recent version, no?
No, the version of transformers is pinned to a version and doesn't install the latest
As additional info, can you say what version of the trainer are you using (main branch or dev branch) and post the full error?
This is the final part of the log with the full error:
use AdamW optimizer | {'weight_decay': 0.1, 'betas': (0.9, 0.99)} override steps. steps for 1 epochs is / 指定エポックまでのステップ数: 80 Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 996, in <module> trainer.train(args) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 400, in train unet = accelerator.prepare(unet) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare result = tuple( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1285, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1090, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1388, in prepare_model autocast_context = get_mixed_precision_context_manager(self.native_amp, self.autocast_handler) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1454, in get_mixed_precision_context_manager return torch.autocast(device_type=state.device.type, dtype=torch.float16, **autocast_kwargs) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 241, in __init__ raise RuntimeError( RuntimeError: User specified an unsupported autocast device_type 'mps' Failed to train because of error: Command '['/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/bin/python', 'sd_scripts/train_network.py', '--config_file=runtime_store/config.toml', '--dataset_config=runtime_store/dataset.toml']' returned non-zero exit status 1.
Not sure how to check the trainer version, but I used the main shell script, so that should download and install from main branch
Try installing the version in the dev branch, is more up to date than main branch. For comparison main branch last update was 3 months ago and dev branch had the last update 2 days ago xD. For installing the dev branch version do git clone -b dev https://github.com/derrian-distro/LoRA_Easy_Training_Scripts
and use the installer file like you did before.
Actually I managed to upgrade the transformers to the version 4.38.1 which you recommended and that solved the previous runtime error.
Now I got past that but have a new one:
load StableDiffusion checkpoint: /Users/xxxxxxxxxx/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.ckpt UNet2DConditionModel: 64, 8, 768, False, False loading u-net: <All keys matched successfully> loading vae: <All keys matched successfully> Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 996, in <module> trainer.train(args) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 234, in train model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 103, in load_target_model text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype, accelerator) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/train_util.py", line 3960, in load_target_model text_encoder, vae, unet, load_stable_diffusion_format = _load_target_model( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/train_util.py", line 3914, in _load_target_model text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/model_util.py", line 1072, in load_models_from_stable_diffusion_checkpoint info = text_model.load_state_dict(converted_text_encoder_checkpoint) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for CLIPTextModel: Unexpected key(s) in state_dict: "text_model.embeddings.position_ids". Failed to train because of error: Command '['/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/bin/python', 'sd_scripts/train_network.py', '--config_file=runtime_store/config.toml', '--dataset_config=runtime_store/dataset.toml']' returned non-zero exit status 1.
Do you think it's best I restart with the dev branch or you think it would be easier solving this one?
Actually I managed to upgrade the transformers to the version 4.38.1 which you recommended and that solved the previous runtime error. Now I got past that but have a new one:
load StableDiffusion checkpoint: /Users/xxxxxxxxxx/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.ckpt UNet2DConditionModel: 64, 8, 768, False, False loading u-net: <All keys matched successfully> loading vae: <All keys matched successfully> Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 996, in <module> trainer.train(args) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 234, in train model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 103, in load_target_model text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype, accelerator) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/train_util.py", line 3960, in load_target_model text_encoder, vae, unet, load_stable_diffusion_format = _load_target_model( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/train_util.py", line 3914, in _load_target_model text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/model_util.py", line 1072, in load_models_from_stable_diffusion_checkpoint info = text_model.load_state_dict(converted_text_encoder_checkpoint) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for CLIPTextModel: Unexpected key(s) in state_dict: "text_model.embeddings.position_ids". Failed to train because of error: Command '['/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/bin/python', 'sd_scripts/train_network.py', '--config_file=runtime_store/config.toml', '--dataset_config=runtime_store/dataset.toml']' returned non-zero exit status 1.
Do you think it's best I restart with the dev branch or you think it would be easier solving this one?
my suggestion would be in a separate folder try to install the dev branch trainer and see if it works, if it's the same issue we can try to fix it for dev branch since is more up to date, to clone on a different folder it would be git clone -b dev https://github.com/derrian-distro/LoRA_Easy_Training_Scripts /path/to/destination/folder
. Also that errors seems because a mismatched transformers version so you gotta update the scripts anyways so yea, dev branch should be better
Thanks I'll install the dev branch and be back with news. Appreciate your help man!
I installed it and tried to run, but first I had to get rid of a few errors that were preventing install. Basically the install.py is throwing errors because it's checking for system.platform=="linux" or "win32" in several places, so I brutally fixed it replacing "linux" with "darwin" so it could pass on macOS. Maybe this is useful for you as you're evolving the script. When I launched a test training, I got a different error:
sh run.sh HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /validate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10e179540>: Failed to establish a new connection: [Errno 61] Connection refused'))
Not sure what went wrong. During the installation there were no errors but two questions, following ones with my answers:
Are you using colab? (y/n): n
Are you using this locally? (y/n): y
That's when you use the "Start Training" button right?
Yes exactly. The connection refused happens at that time.
what happens if you in the UI set the URL to 0.0.0.0:8000
instead of 127.0.0.1:8000
?
Same thing apparently, this is the error in the shell:
TTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /validate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10e178790>: Failed to establish a new connection: [Errno 61] Connection refused'))
maybe you have something running on the port 8000? try for example 127.0.0.1:8001
Even changing the port, it gives the same error:
HTTPConnectionPool(host='0.0.0.0', port=8001): Max retries exceeded with url: /validate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10e17b6d0>: Failed to establish a new connection: [Errno 61] Connection refused'))
I also used lsof command to see if there's anything running but that does not seem the case.
How can I understand if the local server is even running, actually I also get an error when closing the script, which sounds like it cannot stop server because it can't find it.
requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=8001): Max retries exceeded with url: /stop_server (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10e1032b0>: Failed to establish a new connection: [Errno 61] Connection refused'))
Any thoughts?
Did you have anything open while running the trainer?
Nothing crazy, just Chrome with some tabs, a text editor and the shell.
strange tbh
When I launch with "sh run.sh" it does not write anything in the terminal, like nothing is happening. Then I load a previous TOML, again nothing in the terminal. Then immediately after starting training, I get the connection error. Finally closing the script I get final error about stopping the server.
This is the full log: `(base) xxxxxxxxxx@Emilias-MBP LoRA_Easy_Training_Scripts % sh run.sh HTTPConnectionPool(host='0.0.0.0', port=8001): Max retries exceeded with url: /validate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1144c5540>: Failed to establish a new connection: [Errno 61] Connection refused')) Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn sock = connection.create_connection( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 61] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen response = self._make_request( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 496, in _make_request conn.request( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connection.py", line 400, in request self.endheaders() File "/Users/xxxxxxxxxx/miniforge3/lib/python3.10/http/client.py", line 1278, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/Users/xxxxxxxxxx/miniforge3/lib/python3.10/http/client.py", line 1038, in _send_output self.send(msg) File "/Users/xxxxxxxxxx/miniforge3/lib/python3.10/http/client.py", line 976, in send self.connect() File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connection.py", line 238, in connect self.sock = self._new_conn() File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connection.py", line 213, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x1144c4b80>: Failed to establish a new connection: [Errno 61] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen retries = retries.increment( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='0.0.0.0', port=8001): Max retries exceeded with url: /stop_server (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1144c4b80>: Failed to establish a new connection: [Errno 61] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/main.py", line 69, in
Yea, you don't get anything on the terminal when lauch the UI or load a toml
Also I see you still use 0.0.0.0:8001, have you tried 127.0.0.1:8001
?
You could also try to check your private IP (should look like 192.168.x.x
) and use it instead of 0.0.0.0
or 127.0.0.1
It looks it does not matter what I put in the UI as IP address because there's no active server listening. I think it did not start. Is there a way to verify if it's actually running and listening on some port?
I just noticed that the structure of folders is different from the main branch, as there is a Backend subfolder with separate installer.py and main.py, as well as install scripts. Maybe the install procedure is totally different from the one of the main branch and I didn't do everything needed. Do you have instructions for it?
It's the same steps
Ok, now I launched the installer for backend and I am stuck at this error:
Looking in indexes: https://download.pytorch.org/whl/cu121
ERROR: Could not find a version that satisfies the requirement torch==2.2.1 (from versions: none)
ERROR: No matching distribution found for torch==2.2.1
Traceback (most recent call last):
File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/backend/./installer.py", line 198, in
oh, cuz you don't have cuda
Not sure what to do about that, previous version installer was running with no apparent issues. Any instruction to fix it?
wait, do you have a Nvidia GPU?
No, I am on a MacBook Pro with Apple M1 Pro chip which has 10-core CPU and 14-core GPU.
I think training is only possible on Nvidia GPUs or atleast they don't have problems, how much VRAM you got?
At this point, I think your best option would be to use google collab or if you have money to buy the pro plan, paperspace. You can use either the colab of hollowstrawberry or mine.
Seems to be solved so I'm gonna close this issue.
I've installed and tried to run a test training but have this runtime error: RuntimeError: User specified an unsupported autocast device_type 'mps'
The environment is macOS M1 Pro chip. Not sure if it depends from the parameters but I tried to launch the first test with basic choices. I'd be glad to do a test with a specific set of params if it can help rule out some possible causes. Is it a known problem or is there a way to fix this?