-
### Bug description
When model training using DDP and pl.callbacks.BackboneFinetuning, it seems that model weights start to get out of sync across the processes after the backbone is unfrozen. Prio…
-
Hi,
I'm trying to run the training script with Python 3.8.10 and `torch==1.10.2+cu113`, and I obtain the following error:
```shell
>> bash thualign/bin/train.sh -s mask_align -e agree_deen
run…
-
### Discussed in https://github.com/PyTorchLightning/pytorch-lightning/discussions/8363
Originally posted by **MohammedAljahdali** July 10, 2021
Hi, I have a script that does the following log…
-
**Is your feature request related to a problem? Please describe.**
First off, I hope this post doesn't come off as a rant—the reason I'm typing it is because I love the game!
In my personal experien…
-
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
llava/train/train_mem.py \
--model_name_or_path /path/to/checkpoint_llava_med \
--data_path /path/to/your_dental_dataset.jso…
-
### Description & Motivation
I would like to change the the `Tuner` and `LearningRateFinder` API so that it is possible to use more custom models.
#### Description
Currently, the learning ra…
-
Hello.
I'm trying to figure out (based on the source code of the `train` method of `SetFitTrainer`) if it is possible to perform hyperparameter search on the (first) contrastive learning finetunin…
-
### Bug description
Trying to use TPU in Kaggle and receiving the error "RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specif…
-
### Bug description
When using mixed precision with Deepspeed, the model resulted in the error: `RuntimeError: expected scalar type Float but found Half`.
### How to reproduce the bug
```pyth…
-
### Describe the bug
Using the Neptune logger in lightning, I get multiple of the following errors:
```
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (s…