derrian-distro / LoRA_Easy_Training_Scripts

A UI made in Pyside6 to make training LoRA/LoCon and other LoRA type models in sd-scripts easy
GNU General Public License v3.0
998 stars 101 forks source link

Runtime error on MacBook Pro with M1 chip #195

Closed LiminalCrew closed 1 month ago

LiminalCrew commented 3 months ago

I've installed and tried to run a test training but have this runtime error: RuntimeError: User specified an unsupported autocast device_type 'mps'

The environment is macOS M1 Pro chip. Not sure if it depends from the parameters but I tried to launch the first test with basic choices. I'd be glad to do a test with a specific set of params if it can help rule out some possible causes. Is it a known problem or is there a way to fix this?

Jelosus2 commented 3 months ago

Try to upgrade transformers to version 4.38.1, if you think you are not capable of doing it or don't want to risk I can tell you the instructions.

LiminalCrew commented 3 months ago

Hey thanks for the reply. Thing is I've just installed it so it should already be the most recent version, no?

Jelosus2 commented 3 months ago

Hey thanks for the reply. Thing is I've just installed it so it should already be the most recent version, no?

No, the version of transformers is pinned to a version and doesn't install the latest

Jelosus2 commented 3 months ago

As additional info, can you say what version of the trainer are you using (main branch or dev branch) and post the full error?

LiminalCrew commented 3 months ago

This is the final part of the log with the full error: use AdamW optimizer | {'weight_decay': 0.1, 'betas': (0.9, 0.99)} override steps. steps for 1 epochs is / 指定エポックまでのステップ数: 80 Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 996, in <module> trainer.train(args) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 400, in train unet = accelerator.prepare(unet) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare result = tuple( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1285, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1090, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1388, in prepare_model autocast_context = get_mixed_precision_context_manager(self.native_amp, self.autocast_handler) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1454, in get_mixed_precision_context_manager return torch.autocast(device_type=state.device.type, dtype=torch.float16, **autocast_kwargs) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 241, in __init__ raise RuntimeError( RuntimeError: User specified an unsupported autocast device_type 'mps' Failed to train because of error: Command '['/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/bin/python', 'sd_scripts/train_network.py', '--config_file=runtime_store/config.toml', '--dataset_config=runtime_store/dataset.toml']' returned non-zero exit status 1. Not sure how to check the trainer version, but I used the main shell script, so that should download and install from main branch

Jelosus2 commented 3 months ago

Try installing the version in the dev branch, is more up to date than main branch. For comparison main branch last update was 3 months ago and dev branch had the last update 2 days ago xD. For installing the dev branch version do git clone -b dev https://github.com/derrian-distro/LoRA_Easy_Training_Scripts and use the installer file like you did before.

LiminalCrew commented 3 months ago

Actually I managed to upgrade the transformers to the version 4.38.1 which you recommended and that solved the previous runtime error. Now I got past that but have a new one: load StableDiffusion checkpoint: /Users/xxxxxxxxxx/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.ckpt UNet2DConditionModel: 64, 8, 768, False, False loading u-net: <All keys matched successfully> loading vae: <All keys matched successfully> Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 996, in <module> trainer.train(args) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 234, in train model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 103, in load_target_model text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype, accelerator) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/train_util.py", line 3960, in load_target_model text_encoder, vae, unet, load_stable_diffusion_format = _load_target_model( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/train_util.py", line 3914, in _load_target_model text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/model_util.py", line 1072, in load_models_from_stable_diffusion_checkpoint info = text_model.load_state_dict(converted_text_encoder_checkpoint) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for CLIPTextModel: Unexpected key(s) in state_dict: "text_model.embeddings.position_ids". Failed to train because of error: Command '['/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/bin/python', 'sd_scripts/train_network.py', '--config_file=runtime_store/config.toml', '--dataset_config=runtime_store/dataset.toml']' returned non-zero exit status 1.

Do you think it's best I restart with the dev branch or you think it would be easier solving this one?

Jelosus2 commented 3 months ago

Actually I managed to upgrade the transformers to the version 4.38.1 which you recommended and that solved the previous runtime error. Now I got past that but have a new one: load StableDiffusion checkpoint: /Users/xxxxxxxxxx/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.ckpt UNet2DConditionModel: 64, 8, 768, False, False loading u-net: <All keys matched successfully> loading vae: <All keys matched successfully> Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 996, in <module> trainer.train(args) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 234, in train model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 103, in load_target_model text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype, accelerator) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/train_util.py", line 3960, in load_target_model text_encoder, vae, unet, load_stable_diffusion_format = _load_target_model( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/train_util.py", line 3914, in _load_target_model text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/library/model_util.py", line 1072, in load_models_from_stable_diffusion_checkpoint info = text_model.load_state_dict(converted_text_encoder_checkpoint) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for CLIPTextModel: Unexpected key(s) in state_dict: "text_model.embeddings.position_ids". Failed to train because of error: Command '['/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/sd_scripts/venv/bin/python', 'sd_scripts/train_network.py', '--config_file=runtime_store/config.toml', '--dataset_config=runtime_store/dataset.toml']' returned non-zero exit status 1.

Do you think it's best I restart with the dev branch or you think it would be easier solving this one?

my suggestion would be in a separate folder try to install the dev branch trainer and see if it works, if it's the same issue we can try to fix it for dev branch since is more up to date, to clone on a different folder it would be git clone -b dev https://github.com/derrian-distro/LoRA_Easy_Training_Scripts /path/to/destination/folder. Also that errors seems because a mismatched transformers version so you gotta update the scripts anyways so yea, dev branch should be better

LiminalCrew commented 3 months ago

Thanks I'll install the dev branch and be back with news. Appreciate your help man!

LiminalCrew commented 3 months ago

I installed it and tried to run, but first I had to get rid of a few errors that were preventing install. Basically the install.py is throwing errors because it's checking for system.platform=="linux" or "win32" in several places, so I brutally fixed it replacing "linux" with "darwin" so it could pass on macOS. Maybe this is useful for you as you're evolving the script. When I launched a test training, I got a different error:

sh run.sh HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /validate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10e179540>: Failed to establish a new connection: [Errno 61] Connection refused'))

Not sure what went wrong. During the installation there were no errors but two questions, following ones with my answers: Are you using colab? (y/n): n Are you using this locally? (y/n): y

Jelosus2 commented 3 months ago

That's when you use the "Start Training" button right?

LiminalCrew commented 3 months ago

Yes exactly. The connection refused happens at that time.

Jelosus2 commented 3 months ago

what happens if you in the UI set the URL to 0.0.0.0:8000 instead of 127.0.0.1:8000?

LiminalCrew commented 3 months ago

Same thing apparently, this is the error in the shell: TTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /validate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10e178790>: Failed to establish a new connection: [Errno 61] Connection refused'))

Jelosus2 commented 3 months ago

maybe you have something running on the port 8000? try for example 127.0.0.1:8001

LiminalCrew commented 3 months ago

Even changing the port, it gives the same error: HTTPConnectionPool(host='0.0.0.0', port=8001): Max retries exceeded with url: /validate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10e17b6d0>: Failed to establish a new connection: [Errno 61] Connection refused'))

I also used lsof command to see if there's anything running but that does not seem the case.

LiminalCrew commented 3 months ago

How can I understand if the local server is even running, actually I also get an error when closing the script, which sounds like it cannot stop server because it can't find it.

requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=8001): Max retries exceeded with url: /stop_server (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10e1032b0>: Failed to establish a new connection: [Errno 61] Connection refused'))

Any thoughts?

Jelosus2 commented 3 months ago

Did you have anything open while running the trainer?

LiminalCrew commented 3 months ago

Nothing crazy, just Chrome with some tabs, a text editor and the shell.

Jelosus2 commented 3 months ago

strange tbh

LiminalCrew commented 3 months ago

When I launch with "sh run.sh" it does not write anything in the terminal, like nothing is happening. Then I load a previous TOML, again nothing in the terminal. Then immediately after starting training, I get the connection error. Finally closing the script I get final error about stopping the server.

This is the full log: `(base) xxxxxxxxxx@Emilias-MBP LoRA_Easy_Training_Scripts % sh run.sh HTTPConnectionPool(host='0.0.0.0', port=8001): Max retries exceeded with url: /validate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1144c5540>: Failed to establish a new connection: [Errno 61] Connection refused')) Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn sock = connection.create_connection( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 61] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen response = self._make_request( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 496, in _make_request conn.request( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connection.py", line 400, in request self.endheaders() File "/Users/xxxxxxxxxx/miniforge3/lib/python3.10/http/client.py", line 1278, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/Users/xxxxxxxxxx/miniforge3/lib/python3.10/http/client.py", line 1038, in _send_output self.send(msg) File "/Users/xxxxxxxxxx/miniforge3/lib/python3.10/http/client.py", line 976, in send self.connect() File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connection.py", line 238, in connect self.sock = self._new_conn() File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connection.py", line 213, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x1144c4b80>: Failed to establish a new connection: [Errno 61] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen retries = retries.increment( File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='0.0.0.0', port=8001): Max retries exceeded with url: /stop_server (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1144c4b80>: Failed to establish a new connection: [Errno 61] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/main.py", line 69, in main() File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/main.py", line 65, in main requests.get(f"{window.main_widget.backend_url_input.text()}/stop_server") File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/venv/lib/python3.10/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=8001): Max retries exceeded with url: /stop_server (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1144c4b80>: Failed to establish a new connection: [Errno 61] Connection refused')) (base) xxxxxxxxxx@Emilias-MBP LoRA_Easy_Training_Scripts % `

Jelosus2 commented 3 months ago

Yea, you don't get anything on the terminal when lauch the UI or load a toml

Jelosus2 commented 3 months ago

Also I see you still use 0.0.0.0:8001, have you tried 127.0.0.1:8001?

Jelosus2 commented 3 months ago

You could also try to check your private IP (should look like 192.168.x.x) and use it instead of 0.0.0.0 or 127.0.0.1

LiminalCrew commented 3 months ago

It looks it does not matter what I put in the UI as IP address because there's no active server listening. I think it did not start. Is there a way to verify if it's actually running and listening on some port?

LiminalCrew commented 3 months ago

I just noticed that the structure of folders is different from the main branch, as there is a Backend subfolder with separate installer.py and main.py, as well as install scripts. Maybe the install procedure is totally different from the one of the main branch and I didn't do everything needed. Do you have instructions for it?

Jelosus2 commented 3 months ago

It's the same steps

LiminalCrew commented 3 months ago

Ok, now I launched the installer for backend and I am stuck at this error:

Looking in indexes: https://download.pytorch.org/whl/cu121 ERROR: Could not find a version that satisfies the requirement torch==2.2.1 (from versions: none) ERROR: No matching distribution found for torch==2.2.1 Traceback (most recent call last): File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/backend/./installer.py", line 198, in main() File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/backend/./installer.py", line 189, in main setup_venv(pip) File "/Users/xxxxxxxxxx/LoRaTraining/LoRA_Easy_Training_Scripts/backend/./installer.py", line 87, in setup_venv subprocess.check_call( File "/Users/xxxxxxxxxx/miniforge3/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'venv/bin/pip install -U torch==2.2.1 torchvision==0.17.1 --index-url https://download.pytorch.org/whl/cu121' returned non-zero exit status 1.

Jelosus2 commented 3 months ago

oh, cuz you don't have cuda

LiminalCrew commented 3 months ago

Not sure what to do about that, previous version installer was running with no apparent issues. Any instruction to fix it?

Jelosus2 commented 3 months ago

wait, do you have a Nvidia GPU?

LiminalCrew commented 3 months ago

No, I am on a MacBook Pro with Apple M1 Pro chip which has 10-core CPU and 14-core GPU.

Jelosus2 commented 3 months ago

I think training is only possible on Nvidia GPUs or atleast they don't have problems, how much VRAM you got?

Jelosus2 commented 3 months ago

At this point, I think your best option would be to use google collab or if you have money to buy the pro plan, paperspace. You can use either the colab of hollowstrawberry or mine.

Jelosus2 commented 1 month ago

Seems to be solved so I'm gonna close this issue.