-
Hey Team,
I'm trying to use FSDP1/2 with Float8InferenceLinear but seems have some issues (with torch 2.3.1+cu118). Do you suggestion to bump to higher version of torch and have a try or maybe use …
-
# 环境描述
```bash
系统:Ubuntu18.04.6LTS
```
# 描述问题
我使用Funasr文档中导出ONNX模型的python方法尝试导出paraformer-zh-streaming这个预训练模型的ONNX,但一直出现错误!
```bash
(funasr_env) lipeng@lipeng:~/share/modules$ vim export_ON…
-
## ❓ Questions and Help
In pytorch we can use `fsdp meta init` shard restore my big model(like have 80B parameters),in torch_xla i only find shard save like use this.https://github.com/pytorch/xla/bl…
-
Traceback (most recent call last):
File "/opt/conda/envs/alpa/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/alpa…
-
Using the latest main to train a YoloV9e object detector:
```
[rank0]: train_one_epoch(train_loader, model, args, model_dtype)
[rank0]: File "/mnt/dingus_drive/catid/train_detector/train.py…
-
## 📚 Documentation
In the blog introducing [FSDP API](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/)
```python
fsdp_model = FullyShardedDataParallel(
model()…
-
I delete the code “model.prepare_for_distributed_training() ” in dinov2/train.train.py
then my loss become Nan after I train this model after only 1 iter.
I don't know why, I just changed an o…
-
Great work! But I've noticed that the current implementation seems to only support single-GPU training. Is that correct? If so, do you have any plans to extend support for multi-GPU training in the fu…
-
Hi there,
Thanks for the scripts and posts! I am interested in fine-tuning Mixtral 8x7b on sagemaker. The task I have requires around 8k token length.
I have tried running training following th…
-
On both our V100 (Intel Cascade Lake) and A100 (AMD Milan) systems (both RHEL 8.4 currently), I'm seeing too many test failures for `PyTorch/1.12.0-foss-2022a-CUDA-11.7.0`.
On both systems, I get `…