Open ngupta-slb opened 3 months ago
@gorold @liu-jc Could you please look into this issue?
Hey, it might be due to the data being in the wrong directory. The recommended approach is:
huggingface-cli download Salesforce/lotsa_data --repo-type=dataset --local-dir PATH_TO_SAVE
which would download the data into lotsa_data/CMIP6_1855/data-00001-of-00096.arrow
and so on. You'll need to make sure to arrange the data files in this manner. Also I'm not too sure whether you can partially load these files, you may want to try it out on some smaller datasets with only a single arrow file.
I am trying to run the pretraining scripts and encountering the following error while loading the datasets from disk.
GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs [2024-06-17 21:27:11,765][datasets][INFO] - PyTorch version 2.3.1 available. [2024-06-17 21:27:11,767][datasets][INFO] - JAX version 0.4.29 available. Error executing job with overrides: ['run_name=first_run', 'model=moirai_small', 'data=lotsa_v1_unweighted'] Traceback (most recent call last): File "/naveen/uni2ts/cli/train.py", line 130, in main train_dataset: Dataset = instantiate(cfg.data).load_dataset( File "/naveen/uni2ts/src/uni2ts/data/builder/_base.py", line 53, in load_dataset [builder.load_dataset(transform_map) for builder in self.builders] File "/naveen/uni2ts/src/uni2ts/data/builder/_base.py", line 53, in
[builder.load_dataset(transform_map) for builder in self.builders]
File "/naveen/uni2ts/src/uni2ts/data/builder/lotsa_v1/_base.py", line 58, in load_dataset
datasets = [
File "/naveen/uni2ts/src/uni2ts/data/builder/lotsa_v1/_base.py", line 61, in
load_from_disk(self.storage_path / dataset), uniform=self.uniform
File "/naveen/uni2ts/venv/lib/python3.10/site-packages/datasets/load.py", line 2663, in load_from_disk
raise FileNotFoundError(
FileNotFoundError: Directory /uni2ts/lotsa_data/cmip6_1855 is neither a
Dataset
directory nor aDatasetDict
directory.Reproduce the error
huggingface-cli download Salesforce/lotsa_data cmip6_1855/data-00001-of-00096.arrow cmip6_1850/data-00001-of-00096.arrow --repo-type=dataset --local-dir /naveen/uni2ts/lotsa_data
Modified the yaml file to include this dataset only at uni2ts/cli/conf/pretrain/data/lotsa_v1_unweighted.yaml
python3 -m cli.train -cp conf/pretrain run_name=first_run model=moirai_small data=lotsa_v1_unweighted
Python version - 3.10.14
Could you please suggest why it is unable to load the data? When I look at Huggingface load_from_disk API, it states that you need to use save_from_disk, however, I do not see save_from_disk being called before the load_from_disk? Please advise how to fix this issue.
Thank you very much!