Closed leannmlindsey closed 10 months ago
I am running pretraining on 2 Nvidia A100 machines with 80GB memory.
Hard to tell. Is this the hg38 fasta file? Which Pytorch / Lightning versions are you using?
I was following the instructions from here https://github.com/HazyResearch/hyena-dna#pretraining-on-human-reference-genome
Yes, it is the hg38 fasta file
pytorch libraries loaded in the conda env
pytorch 1.13.0 py3.8_cuda11.7_cudnn8.5.0_0 pytorch pytorch-cuda 11.7 h778d358_5 pytorch pytorch-lightning 1.8.6 pypi_0 pypi pytorch-mutex 1.0 cuda pytorch
lightning libraries loaded in the conda env
lightning-utilities 0.10.0 pypi_0 pypi
Full Conda Env
(p100_hyena-dna) [u1323098@kp360:~]$ conda list
#
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
accelerate 0.24.1 pypi_0 pypi
aiohttp 3.9.0 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
antlr4-python3-runtime 4.9.3 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
async-timeout 4.0.3 pypi_0 pypi
attrs 23.1.0 pypi_0 pypi
beautifulsoup4 4.12.2 pypi_0 pypi
biopython 1.81 pypi_0 pypi
blas 1.0 mkl
brotli-python 1.1.0 py38h17151c0_1 conda-forge
bzip2 1.0.8 hd590300_5 conda-forge
ca-certificates 2023.11.17 hbcca054_0 conda-forge
certifi 2023.11.17 pyhd8ed1ab_0 conda-forge
charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge
click 8.1.7 pypi_0 pypi
cmake 3.27.7 pypi_0 pypi
contourpy 1.1.1 pypi_0 pypi
cuda-cudart 11.7.99 0 nvidia
cuda-cupti 11.7.101 0 nvidia
cuda-libraries 11.7.1 0 nvidia
cuda-nvrtc 11.7.99 0 nvidia
cuda-nvtx 11.7.91 0 nvidia
cuda-runtime 11.7.1 0 nvidia
cycler 0.12.1 pypi_0 pypi
datasets 2.15.0 pypi_0 pypi
dill 0.3.7 pypi_0 pypi
docker-pycreds 0.4.0 pypi_0 pypi
einops 0.7.0 pypi_0 pypi
ffmpeg 4.3 hf484d3e_0 pytorch
filelock 3.13.1 pypi_0 pypi
flash-attn 1.0.7 pypi_0 pypi
fonttools 4.45.1 pypi_0 pypi
freetype 2.12.1 h267a509_2 conda-forge
frozenlist 1.4.0 pypi_0 pypi
fsspec 2023.10.0 pypi_0 pypi
gdown 4.7.1 pypi_0 pypi
genomic-benchmarks 0.0.9 pypi_0 pypi
git-lfs 1.6 pypi_0 pypi
gitdb 4.0.11 pypi_0 pypi
gitpython 3.1.40 pypi_0 pypi
gmp 6.3.0 h59595ed_0 conda-forge
gnutls 3.6.13 h85f3911_1 conda-forge
huggingface-hub 0.19.4 pypi_0 pypi
hydra-core 1.3.2 pypi_0 pypi
idna 3.5 pyhd8ed1ab_0 conda-forge
importlib-metadata 6.8.0 pypi_0 pypi
importlib-resources 6.1.1 pypi_0 pypi
intel-openmp 2021.4.0 h06a4308_3561
joblib 1.3.2 pypi_0 pypi
jpeg 9e h0b41bf4_3 conda-forge
kiwisolver 1.4.5 pypi_0 pypi
lame 3.100 h166bdaf_1003 conda-forge
lcms2 2.15 hfd0df8a_0 conda-forge
ld_impl_linux-64 2.40 h41732ed_0 conda-forge
lerc 4.0.0 h27087fc_0 conda-forge
libcublas 11.10.3.66 0 nvidia
libcufft 10.7.2.124 h4fbf590_0 nvidia
libcufile 1.8.1.2 0 nvidia
libcurand 10.3.4.101 0 nvidia
libcusolver 11.4.0.1 0 nvidia
libcusparse 11.7.4.91 0 nvidia
libdeflate 1.17 h0b41bf4_0 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 13.2.0 h807b86a_3 conda-forge
libgomp 13.2.0 h807b86a_3 conda-forge
libiconv 1.17 h166bdaf_0 conda-forge
libnpp 11.7.4.75 0 nvidia
libnsl 2.0.1 hd590300_0 conda-forge
libnvjpeg 11.8.0.2 0 nvidia
libpng 1.6.39 h753d276_0 conda-forge
libsqlite 3.44.2 h2797004_0 conda-forge
libstdcxx-ng 13.2.0 h7e041cc_3 conda-forge
libtiff 4.5.0 h6adf6a1_2 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libwebp-base 1.3.2 hd590300_0 conda-forge
libxcb 1.13 h7f98852_1004 conda-forge
libzlib 1.2.13 hd590300_5 conda-forge
liftover 1.1.17 pypi_0 pypi
lightning-utilities 0.10.0 pypi_0 pypi
loguru 0.7.2 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
matplotlib 3.7.4 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py38h95df7f1_0 conda-forge
mkl_fft 1.3.1 py38h8666266_1 conda-forge
mkl_random 1.2.2 py38h1abd341_0 conda-forge
multidict 6.0.4 pypi_0 pypi
multiprocess 0.70.15 pypi_0 pypi
ncurses 6.4 h59595ed_2 conda-forge
nettle 3.6 he412f7d_0 conda-forge
ninja 1.11.1.1 pypi_0 pypi
numerize 0.12 pypi_0 pypi
numpy 1.24.3 py38h14f4228_0
numpy-base 1.24.3 py38h31eccc5_0
omegaconf 2.3.0 pypi_0 pypi
openh264 2.1.1 h780b84a_0 conda-forge
openjpeg 2.5.0 hfec8fc6_2 conda-forge
openssl 3.2.0 hd590300_0 conda-forge
opt-einsum 3.3.0 pypi_0 pypi
packaging 23.2 pypi_0 pypi
pandas 2.0.3 pypi_0 pypi
pillow 9.4.0 py38hde6dc18_1 conda-forge
pip 23.3.1 pyhd8ed1ab_0 conda-forge
polars 0.19.15 pypi_0 pypi
prettytable 3.9.0 pypi_0 pypi
protobuf 4.25.1 pypi_0 pypi
psutil 5.9.6 pypi_0 pypi
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
pyarrow 14.0.1 pypi_0 pypi
pyarrow-hotfix 0.6 pypi_0 pypi
pyfaidx 0.7.2.2 pypi_0 pypi
pygments 2.17.2 pypi_0 pypi
pyparsing 3.1.1 pypi_0 pypi
pysocks 1.7.1 pyha2e5f31_6 conda-forge
python 3.8.18 hd12c33a_0_cpython conda-forge
python-dateutil 2.8.2 pypi_0 pypi
python_abi 3.8 4_cp38 conda-forge
pytorch 1.13.0 py3.8_cuda11.7_cudnn8.5.0_0 pytorch
pytorch-cuda 11.7 h778d358_5 pytorch
pytorch-lightning 1.8.6 pypi_0 pypi
pytorch-mutex 1.0 cuda pytorch
pytz 2023.3.post1 pypi_0 pypi
pyyaml 6.0.1 pypi_0 pypi
readline 8.2 h8228510_1 conda-forge
regex 2023.10.3 pypi_0 pypi
requests 2.31.0 pyhd8ed1ab_0 conda-forge
rich 13.7.0 pypi_0 pypi
safetensors 0.4.0 pypi_0 pypi
scikit-learn 1.3.2 pypi_0 pypi
scipy 1.10.1 pypi_0 pypi
sentry-sdk 1.37.1 pypi_0 pypi
setproctitle 1.3.3 pypi_0 pypi
setuptools 68.2.2 pyhd8ed1ab_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
smmap 5.0.1 pypi_0 pypi
soupsieve 2.5 pypi_0 pypi
tensorboardx 2.6.2.2 pypi_0 pypi
threadpoolctl 3.2.0 pypi_0 pypi
timm 0.9.12 pypi_0 pypi
tk 8.6.13 noxft_h4845f30_101 conda-forge
tokenizers 0.13.3 pypi_0 pypi
torchaudio 0.13.0 py38_cu117 pytorch
torchmetrics 1.2.0 pypi_0 pypi
torchtext 0.14.0 pypi_0 pypi
torchvision 0.14.0 py38_cu117 pytorch
tqdm 4.66.1 pypi_0 pypi
transformers 4.26.1 pypi_0 pypi
typing_extensions 4.8.0 pyha770c72_0 conda-forge
tzdata 2023.3 pypi_0 pypi
urllib3 2.1.0 pyhd8ed1ab_0 conda-forge
wandb 0.16.0 pypi_0 pypi
wcwidth 0.2.12 pypi_0 pypi
wheel 0.41.3 pyhd8ed1ab_0 conda-forge
xorg-libxau 1.0.11 hd590300_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xxhash 3.4.1 pypi_0 pypi
xz 5.2.6 h166bdaf_0 conda-forge
yarl 1.9.3 pypi_0 pypi
zipp 3.17.0 pypi_0 pypi
zlib 1.2.13 hd590300_5 conda-forge
zstd 1.5.5 hfc55251_0 conda-forge
I am not sure, I would try the docker image in the readme and reverse engineer perhaps.
It turned out to be a problem that I had originally installed it to run on a P100 machine and ran all of the fine-tuning on that machine (since is is more available on my CHPC system), and then I thought the same code could be used to run on the A100, but I had the error described above.
When I ran the pretraining on the P100, it runs with no problem.
It seems that something in the installation process is architecture specific?
I am now installing a new copy on the A100 and hopefully I won't have any trouble getting both to run using that system.
I was able to run both fine tuning and pre-training on the A100 with a clean install.
Hello, I was trying to follow your directions on pretraining on the human genome (as a test before I try to pretrain on my own data) and I keep getting this error:
RuntimeError: Trying to resize storage that is not resizable
The first time it happened after training epoch 40 and the second time after training epoch 60. Do you know what the error could be?
Thanks for any help. I do not seem to have any problems with Fine tuning.
Thanks, LeAnn
Epoch 60: 95%|▉| 135/142 [00:15<00:00, 8.94it/s, loss=1.17, val/loss=1.170, val/num_tokens=1.37e+8, val/perplexity=3.230, test/loss=1.170, test/num_tokens=1.2e+8, test/perplexError executing job with overrides: ['wandb=null', 'experiment=hg38/hg38_hyena', 'model.d_model=128', 'model.n_layer=2', 'dataset.batch_size=256', 'train.global_batch_size=256', 'dataset.max_length=1024', 'optimizer.lr=6e-4', 'trainer.devices=1'] Traceback (most recent call last): File "/uufs/chpc.utah.edu/common/home/sundar-group2/PHAGE/MODELS/P100_HYENA/hyena-dna/train.py", line 691, in main train(config) File "/uufs/chpc.utah.edu/common/home/sundar-group2/PHAGE/MODELS/P100_HYENA/hyena-dna/train.py", line 672, in train trainer.fit(model) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run results = self._run_stage() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage self._run_train() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train self.fit_loop.run() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.on_advance_end() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end self._run_validation() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation self.val_loop.run() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance batch = next(data_fetcher) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next return self.fetching_function() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function self._fetch_next_batch(self.dataloader_iter) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch batch = next(iterator) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 2. Original Traceback (most recent call last): File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate return collate(batch, collate_fn_map=default_collate_fn_map) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in collate return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 120, in collate
return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 162, in collate_tensorfn
out = elem.new(storage).resize(len(batch), list(elem.size()))
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.