HazyResearch / hyena-dna

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena
https://arxiv.org/abs/2306.15794
Apache License 2.0
574 stars 82 forks source link

Error in Pretraining on Human Genome #34

Closed leannmlindsey closed 10 months ago

leannmlindsey commented 10 months ago

Hello, I was trying to follow your directions on pretraining on the human genome (as a test before I try to pretrain on my own data) and I keep getting this error:

RuntimeError: Trying to resize storage that is not resizable

The first time it happened after training epoch 40 and the second time after training epoch 60. Do you know what the error could be?

Thanks for any help. I do not seem to have any problems with Fine tuning.

Thanks, LeAnn

Epoch 60: 95%|▉| 135/142 [00:15<00:00, 8.94it/s, loss=1.17, val/loss=1.170, val/num_tokens=1.37e+8, val/perplexity=3.230, test/loss=1.170, test/num_tokens=1.2e+8, test/perplexError executing job with overrides: ['wandb=null', 'experiment=hg38/hg38_hyena', 'model.d_model=128', 'model.n_layer=2', 'dataset.batch_size=256', 'train.global_batch_size=256', 'dataset.max_length=1024', 'optimizer.lr=6e-4', 'trainer.devices=1'] Traceback (most recent call last): File "/uufs/chpc.utah.edu/common/home/sundar-group2/PHAGE/MODELS/P100_HYENA/hyena-dna/train.py", line 691, in main train(config) File "/uufs/chpc.utah.edu/common/home/sundar-group2/PHAGE/MODELS/P100_HYENA/hyena-dna/train.py", line 672, in train trainer.fit(model) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run results = self._run_stage() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage self._run_train() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train self.fit_loop.run() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.on_advance_end() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end self._run_validation() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation self.val_loop.run() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance batch = next(data_fetcher) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next return self.fetching_function() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function self._fetch_next_batch(self.dataloader_iter) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch batch = next(iterator) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 2. Original Traceback (most recent call last): File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate return collate(batch, collate_fn_map=default_collate_fn_map) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in collate return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 120, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 162, in collate_tensorfn out = elem.new(storage).resize(len(batch), list(elem.size()))

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

leannmlindsey commented 10 months ago

I am running pretraining on 2 Nvidia A100 machines with 80GB memory.

exnx commented 10 months ago

Hard to tell. Is this the hg38 fasta file? Which Pytorch / Lightning versions are you using?

leannmlindsey commented 10 months ago

I was following the instructions from here https://github.com/HazyResearch/hyena-dna#pretraining-on-human-reference-genome

Yes, it is the hg38 fasta file

pytorch libraries loaded in the conda env

pytorch 1.13.0 py3.8_cuda11.7_cudnn8.5.0_0 pytorch pytorch-cuda 11.7 h778d358_5 pytorch pytorch-lightning 1.8.6 pypi_0 pypi pytorch-mutex 1.0 cuda pytorch

lightning libraries loaded in the conda env

lightning-utilities 0.10.0 pypi_0 pypi

Full Conda Env

(p100_hyena-dna) [u1323098@kp360:~]$ conda list

packages in environment at /uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna:

#

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge accelerate 0.24.1 pypi_0 pypi aiohttp 3.9.0 pypi_0 pypi aiosignal 1.3.1 pypi_0 pypi antlr4-python3-runtime 4.9.3 pypi_0 pypi appdirs 1.4.4 pypi_0 pypi async-timeout 4.0.3 pypi_0 pypi attrs 23.1.0 pypi_0 pypi beautifulsoup4 4.12.2 pypi_0 pypi biopython 1.81 pypi_0 pypi blas 1.0 mkl
brotli-python 1.1.0 py38h17151c0_1 conda-forge bzip2 1.0.8 hd590300_5 conda-forge ca-certificates 2023.11.17 hbcca054_0 conda-forge certifi 2023.11.17 pyhd8ed1ab_0 conda-forge charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge click 8.1.7 pypi_0 pypi cmake 3.27.7 pypi_0 pypi contourpy 1.1.1 pypi_0 pypi cuda-cudart 11.7.99 0 nvidia cuda-cupti 11.7.101 0 nvidia cuda-libraries 11.7.1 0 nvidia cuda-nvrtc 11.7.99 0 nvidia cuda-nvtx 11.7.91 0 nvidia cuda-runtime 11.7.1 0 nvidia cycler 0.12.1 pypi_0 pypi datasets 2.15.0 pypi_0 pypi dill 0.3.7 pypi_0 pypi docker-pycreds 0.4.0 pypi_0 pypi einops 0.7.0 pypi_0 pypi ffmpeg 4.3 hf484d3e_0 pytorch filelock 3.13.1 pypi_0 pypi flash-attn 1.0.7 pypi_0 pypi fonttools 4.45.1 pypi_0 pypi freetype 2.12.1 h267a509_2 conda-forge frozenlist 1.4.0 pypi_0 pypi fsspec 2023.10.0 pypi_0 pypi gdown 4.7.1 pypi_0 pypi genomic-benchmarks 0.0.9 pypi_0 pypi git-lfs 1.6 pypi_0 pypi gitdb 4.0.11 pypi_0 pypi gitpython 3.1.40 pypi_0 pypi gmp 6.3.0 h59595ed_0 conda-forge gnutls 3.6.13 h85f3911_1 conda-forge huggingface-hub 0.19.4 pypi_0 pypi hydra-core 1.3.2 pypi_0 pypi idna 3.5 pyhd8ed1ab_0 conda-forge importlib-metadata 6.8.0 pypi_0 pypi importlib-resources 6.1.1 pypi_0 pypi intel-openmp 2021.4.0 h06a4308_3561
joblib 1.3.2 pypi_0 pypi jpeg 9e h0b41bf4_3 conda-forge kiwisolver 1.4.5 pypi_0 pypi lame 3.100 h166bdaf_1003 conda-forge lcms2 2.15 hfd0df8a_0 conda-forge ld_impl_linux-64 2.40 h41732ed_0 conda-forge lerc 4.0.0 h27087fc_0 conda-forge libcublas 11.10.3.66 0 nvidia libcufft 10.7.2.124 h4fbf590_0 nvidia libcufile 1.8.1.2 0 nvidia libcurand 10.3.4.101 0 nvidia libcusolver 11.4.0.1 0 nvidia libcusparse 11.7.4.91 0 nvidia libdeflate 1.17 h0b41bf4_0 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 13.2.0 h807b86a_3 conda-forge libgomp 13.2.0 h807b86a_3 conda-forge libiconv 1.17 h166bdaf_0 conda-forge libnpp 11.7.4.75 0 nvidia libnsl 2.0.1 hd590300_0 conda-forge libnvjpeg 11.8.0.2 0 nvidia libpng 1.6.39 h753d276_0 conda-forge libsqlite 3.44.2 h2797004_0 conda-forge libstdcxx-ng 13.2.0 h7e041cc_3 conda-forge libtiff 4.5.0 h6adf6a1_2 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libwebp-base 1.3.2 hd590300_0 conda-forge libxcb 1.13 h7f98852_1004 conda-forge libzlib 1.2.13 hd590300_5 conda-forge liftover 1.1.17 pypi_0 pypi lightning-utilities 0.10.0 pypi_0 pypi loguru 0.7.2 pypi_0 pypi markdown-it-py 3.0.0 pypi_0 pypi matplotlib 3.7.4 pypi_0 pypi mdurl 0.1.2 pypi_0 pypi mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py38h95df7f1_0 conda-forge mkl_fft 1.3.1 py38h8666266_1 conda-forge mkl_random 1.2.2 py38h1abd341_0 conda-forge multidict 6.0.4 pypi_0 pypi multiprocess 0.70.15 pypi_0 pypi ncurses 6.4 h59595ed_2 conda-forge nettle 3.6 he412f7d_0 conda-forge ninja 1.11.1.1 pypi_0 pypi numerize 0.12 pypi_0 pypi numpy 1.24.3 py38h14f4228_0
numpy-base 1.24.3 py38h31eccc5_0
omegaconf 2.3.0 pypi_0 pypi openh264 2.1.1 h780b84a_0 conda-forge openjpeg 2.5.0 hfec8fc6_2 conda-forge openssl 3.2.0 hd590300_0 conda-forge opt-einsum 3.3.0 pypi_0 pypi packaging 23.2 pypi_0 pypi pandas 2.0.3 pypi_0 pypi pillow 9.4.0 py38hde6dc18_1 conda-forge pip 23.3.1 pyhd8ed1ab_0 conda-forge polars 0.19.15 pypi_0 pypi prettytable 3.9.0 pypi_0 pypi protobuf 4.25.1 pypi_0 pypi psutil 5.9.6 pypi_0 pypi pthread-stubs 0.4 h36c2ea0_1001 conda-forge pyarrow 14.0.1 pypi_0 pypi pyarrow-hotfix 0.6 pypi_0 pypi pyfaidx 0.7.2.2 pypi_0 pypi pygments 2.17.2 pypi_0 pypi pyparsing 3.1.1 pypi_0 pypi pysocks 1.7.1 pyha2e5f31_6 conda-forge python 3.8.18 hd12c33a_0_cpython conda-forge python-dateutil 2.8.2 pypi_0 pypi python_abi 3.8 4_cp38 conda-forge pytorch 1.13.0 py3.8_cuda11.7_cudnn8.5.0_0 pytorch pytorch-cuda 11.7 h778d358_5 pytorch pytorch-lightning 1.8.6 pypi_0 pypi pytorch-mutex 1.0 cuda pytorch pytz 2023.3.post1 pypi_0 pypi pyyaml 6.0.1 pypi_0 pypi readline 8.2 h8228510_1 conda-forge regex 2023.10.3 pypi_0 pypi requests 2.31.0 pyhd8ed1ab_0 conda-forge rich 13.7.0 pypi_0 pypi safetensors 0.4.0 pypi_0 pypi scikit-learn 1.3.2 pypi_0 pypi scipy 1.10.1 pypi_0 pypi sentry-sdk 1.37.1 pypi_0 pypi setproctitle 1.3.3 pypi_0 pypi setuptools 68.2.2 pyhd8ed1ab_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge smmap 5.0.1 pypi_0 pypi soupsieve 2.5 pypi_0 pypi tensorboardx 2.6.2.2 pypi_0 pypi threadpoolctl 3.2.0 pypi_0 pypi timm 0.9.12 pypi_0 pypi tk 8.6.13 noxft_h4845f30_101 conda-forge tokenizers 0.13.3 pypi_0 pypi torchaudio 0.13.0 py38_cu117 pytorch torchmetrics 1.2.0 pypi_0 pypi torchtext 0.14.0 pypi_0 pypi torchvision 0.14.0 py38_cu117 pytorch tqdm 4.66.1 pypi_0 pypi transformers 4.26.1 pypi_0 pypi typing_extensions 4.8.0 pyha770c72_0 conda-forge tzdata 2023.3 pypi_0 pypi urllib3 2.1.0 pyhd8ed1ab_0 conda-forge wandb 0.16.0 pypi_0 pypi wcwidth 0.2.12 pypi_0 pypi wheel 0.41.3 pyhd8ed1ab_0 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xxhash 3.4.1 pypi_0 pypi xz 5.2.6 h166bdaf_0 conda-forge yarl 1.9.3 pypi_0 pypi zipp 3.17.0 pypi_0 pypi zlib 1.2.13 hd590300_5 conda-forge zstd 1.5.5 hfc55251_0 conda-forge

exnx commented 10 months ago

I am not sure, I would try the docker image in the readme and reverse engineer perhaps.

leannmlindsey commented 10 months ago

It turned out to be a problem that I had originally installed it to run on a P100 machine and ran all of the fine-tuning on that machine (since is is more available on my CHPC system), and then I thought the same code could be used to run on the A100, but I had the error described above.

When I ran the pretraining on the P100, it runs with no problem.

It seems that something in the installation process is architecture specific?

I am now installing a new copy on the A100 and hopefully I won't have any trouble getting both to run using that system.

leannmlindsey commented 10 months ago

I was able to run both fine tuning and pre-training on the A100 with a clean install.