Unable to use load_from_disk function in pretraining

ngupta-slb commented 3 months ago

I am trying to run the pretraining scripts and encountering the following error while loading the datasets from disk.

GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs [2024-06-17 21:27:11,765][datasets][INFO] - PyTorch version 2.3.1 available. [2024-06-17 21:27:11,767][datasets][INFO] - JAX version 0.4.29 available. Error executing job with overrides: ['run_name=first_run', 'model=moirai_small', 'data=lotsa_v1_unweighted'] Traceback (most recent call last): File "/naveen/uni2ts/cli/train.py", line 130, in main train_dataset: Dataset = instantiate(cfg.data).load_dataset( File "/naveen/uni2ts/src/uni2ts/data/builder/_base.py", line 53, in load_dataset [builder.load_dataset(transform_map) for builder in self.builders] File "/naveen/uni2ts/src/uni2ts/data/builder/_base.py", line 53, in [builder.load_dataset(transform_map) for builder in self.builders] File "/naveen/uni2ts/src/uni2ts/data/builder/lotsa_v1/_base.py", line 58, in load_dataset datasets = [ File "/naveen/uni2ts/src/uni2ts/data/builder/lotsa_v1/_base.py", line 61, in load_from_disk(self.storage_path / dataset), uniform=self.uniform File "/naveen/uni2ts/venv/lib/python3.10/site-packages/datasets/load.py", line 2663, in load_from_disk raise FileNotFoundError( FileNotFoundError: Directory /uni2ts/lotsa_data/cmip6_1855 is neither a Dataset directory nor a DatasetDict directory.

Reproduce the error

Downloaded only a small fraction of data using the following command

huggingface-cli download Salesforce/lotsa_data cmip6_1855/data-00001-of-00096.arrow cmip6_1850/data-00001-of-00096.arrow --repo-type=dataset --local-dir /naveen/uni2ts/lotsa_data

Modified the yaml file to include this dataset only at uni2ts/cli/conf/pretrain/data/lotsa_v1_unweighted.yaml
python3 -m cli.train -cp conf/pretrain run_name=first_run model=moirai_small data=lotsa_v1_unweighted

Python version - 3.10.14

Could you please suggest why it is unable to load the data? When I look at Huggingface load_from_disk API, it states that you need to use save_from_disk, however, I do not see save_from_disk being called before the load_from_disk? Please advise how to fix this issue.

Thank you very much!

ngupta-slb commented 3 months ago

@gorold @liu-jc Could you please look into this issue?

gorold commented 3 months ago

Hey, it might be due to the data being in the wrong directory. The recommended approach is: huggingface-cli download Salesforce/lotsa_data --repo-type=dataset --local-dir PATH_TO_SAVE which would download the data into lotsa_data/CMIP6_1855/data-00001-of-00096.arrow and so on. You'll need to make sure to arrange the data files in this manner. Also I'm not too sure whether you can partially load these files, you may want to try it out on some smaller datasets with only a single arrow file.

SalesforceAIResearch / uni2ts

Unable to use load_from_disk function in pretraining #74