-
### Current Behavior
> Originally opened as an issue in the PL repo https://github.com/Lightning-AI/lightning/issues/18251
### Bug description
This morning I woke up to a very weird result.
…
-
### Issue type
Need help
### Summary
Some functions in `/cellbox/train.py` have some ambiguity in what task they perform. These are crucial to understand to reproduce similar results for Pytorch …
-
I checked the Issues but I haven't found anyone else posting this error so I'm not sure if it's related to my environment, something I am doing wrong, or a bug in the actual library/toolset.
I crea…
-
Hi:
I want to get the train log, train loss plot, val loss plot and lr plot using WandB callback. But it seems that I just can get PART of training log.
I just follow the tutorial code:
```
…
-
On an Amazon g2.2xlarge instance, train_net.py, I get an out-of-memory error.
Stats:
Limit: 3868721152
InUse: 3824706816
MaxInUse: 3825321984
N…
-
I am trying to run this code in distributed tensorflow mode and have modified the code accordingly (i.e. using MonitoredTrainingSession and so on). But trying to use monitored training session doesn't…
-
Hi,
Have you ever tried muti-gpus training? I simply add DataParallel but the AP and AR are lower than the training with single gpu.
Thanks!
-
HI
i'm try to train a base model, but seems does't works with GPU?..is very slow and no output from (verbose =True)...
any idea?
Thanks
-
Hello, Jerry Sun. Thank you for the sharing of your good implementation of DDP training for CrossPoint.
When I was conducting the training, I met the issue:
work = default_pg.allgather([tensor_li…
-
### 🐛 Describe the bug
I used Hugging face training code.
I found during backward of training by using FSDP, the AllGather kernel doesn't overlap CatArrayBatchedCopy kernel. I don't know why.
s…