DeepWok / mase

Machine-Learning Accelerator System Exploration Tools
Other
108 stars 52 forks source link

Open file issue caused by incomplete dataset installation #25

Closed Yanzhou-Jin closed 5 months ago

Yanzhou-Jin commented 5 months ago

The terminal has been killed accidentally by an external interrupt when running code: ./ch train jsc-tiny jsc --max-epochs 10 --batch-size 256

After that the code will have following issues:

INFO Initialising model 'jsc-tiny'... INFO Initialising dataset 'jsc'... INFO Project will be created at /home/super_monkey/mase/mase_output/jsc-tiny_classification_jsc_2024-01-27 INFO Training model 'jsc-tiny'... Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Traceback (most recent call last): Traceback (most recent call last): File "/home/super_monkey/mase/machop/./ch", line 6, in ChopCLI().run() File "/home/super_monkey/mase/machop/chop/cli.py", line 245, in run self._run_train() File "/home/super_monkey/mase/machop/chop/cli.py", line 291, in _run_train train(train_params) File "/home/super_monkey/mase/machop/chop/actions/train.py", line 109, in train trainer.fit( File "/home/super_monkey/anaconda3/envs/mase/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/home/super_monkey/anaconda3/envs/mase/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/home/super_monkey/anaconda3/envs/mase/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/super_monkey/anaconda3/envs/mase/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 941, in _run self._data_connector.prepare_data() File "/home/super_monkey/anaconda3/envs/mase/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 94, in prepare_data call._call_lightning_datamodule_hook(trainer, "prepare_data") File "/home/super_monkey/anaconda3/envs/mase/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 179, in _call_lightning_datamodule_hook return fn(args, kwargs) File "/home/super_monkey/mase/machop/chop/dataset/init.py", line 191, in prepare_data train_dataset.prepare_data() File "/home/super_monkey/mase/machop/chop/dataset/physical/jsc.py", line 167, in prepare_data _preprocess_jsc_dataset(self.h5py_file_path, self.config) File "/home/super_monkey/mase/machop/chop/dataset/physical/jsc.py", line 82, in _preprocess_jsc_dataset with h5py.File(path, "r") as h5py_file: File "/home/super_monkey/anaconda3/envs/mase/lib/python3.10/site-packages/h5py/_hl/files.py", line 562, in init fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) File "/home/super_monkey/anaconda3/envs/mase/lib/python3.10/site-packages/h5py/_hl/files.py", line 235, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 102, in h5py.h5f.open <

This can be fixed manually by removing the file inside a hidden folder './.machop_cache/dataset', the programme didn't report any issue about the incompleteness. Maybe it will be a good idea to check the integrity of the dataset before open files.

Aaron-Zhao123 commented 5 months ago

I think this is fairly apparent that when the h5py is throwing an error, it is likely that the corresponding file is corrupted. I don't think we should have check on all datasets. We now have supported a great number of them.