SalesforceAIResearch / uni2ts

[ICML2024] Unified Training of Universal Time Series Forecasting Transformers
Apache License 2.0
793 stars 79 forks source link

Unable to use load_from_disk function in pretraining #74

Open ngupta-slb opened 3 months ago

ngupta-slb commented 3 months ago

I am trying to run the pretraining scripts and encountering the following error while loading the datasets from disk.

GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs [2024-06-17 21:27:11,765][datasets][INFO] - PyTorch version 2.3.1 available. [2024-06-17 21:27:11,767][datasets][INFO] - JAX version 0.4.29 available. Error executing job with overrides: ['run_name=first_run', 'model=moirai_small', 'data=lotsa_v1_unweighted'] Traceback (most recent call last): File "/naveen/uni2ts/cli/train.py", line 130, in main train_dataset: Dataset = instantiate(cfg.data).load_dataset( File "/naveen/uni2ts/src/uni2ts/data/builder/_base.py", line 53, in load_dataset [builder.load_dataset(transform_map) for builder in self.builders] File "/naveen/uni2ts/src/uni2ts/data/builder/_base.py", line 53, in [builder.load_dataset(transform_map) for builder in self.builders] File "/naveen/uni2ts/src/uni2ts/data/builder/lotsa_v1/_base.py", line 58, in load_dataset datasets = [ File "/naveen/uni2ts/src/uni2ts/data/builder/lotsa_v1/_base.py", line 61, in load_from_disk(self.storage_path / dataset), uniform=self.uniform File "/naveen/uni2ts/venv/lib/python3.10/site-packages/datasets/load.py", line 2663, in load_from_disk raise FileNotFoundError( FileNotFoundError: Directory /uni2ts/lotsa_data/cmip6_1855 is neither a Dataset directory nor a DatasetDict directory.

Reproduce the error

  1. Downloaded only a small fraction of data using the following command

huggingface-cli download Salesforce/lotsa_data cmip6_1855/data-00001-of-00096.arrow cmip6_1850/data-00001-of-00096.arrow --repo-type=dataset --local-dir /naveen/uni2ts/lotsa_data

  1. Modified the yaml file to include this dataset only at uni2ts/cli/conf/pretrain/data/lotsa_v1_unweighted.yaml

  2. python3 -m cli.train -cp conf/pretrain run_name=first_run model=moirai_small data=lotsa_v1_unweighted

Python version - 3.10.14

Could you please suggest why it is unable to load the data? When I look at Huggingface load_from_disk API, it states that you need to use save_from_disk, however, I do not see save_from_disk being called before the load_from_disk? Please advise how to fix this issue.

Thank you very much!

ngupta-slb commented 3 months ago

@gorold @liu-jc Could you please look into this issue?

gorold commented 3 months ago

Hey, it might be due to the data being in the wrong directory. The recommended approach is: huggingface-cli download Salesforce/lotsa_data --repo-type=dataset --local-dir PATH_TO_SAVE which would download the data into lotsa_data/CMIP6_1855/data-00001-of-00096.arrow and so on. You'll need to make sure to arrange the data files in this manner. Also I'm not too sure whether you can partially load these files, you may want to try it out on some smaller datasets with only a single arrow file.