Problem running ML Ops training script on Windows

allannof commented 9 months ago

Did:

Follow instructions for setting up the project in ML Ops README found in this Github
Ran python train.py --experiment-name "$EXPERIMENT_NAME" --dataset-loc "$DATASET_LOC" --train-loop-config "$TRAIN_LOOP_CONFIG" --num-samples 1000 --num-workers 4 --cpu-per-worker 1 --gpu-per-worker 0 --num-epochs 15 --batch-size 16 --results-fp results/training_results.json from the driver_fatigue_detection/mlops directory.

Happened: TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.

To continue this run, you can use:trainer = TorchTrainer.restore("anonymous_user_directory\ray_results\TorchTrainer_2023-11-22_09-30-02")`.

(RayTrainWorker pid=12820) Reducer buckets have been rebuilt in this iteration. [repeated 3x across cluster]

TuneError: Sync process failed: GetFileInfo() yielded path

'anonymous_user_directory/ray_results/TorchTrainer_2023-11-22_09-30-02/TorchTrainer_5500b_00000_0_2023-11-22_09-30-03', which is outside base dir

'anonymous_user_directory\ray_results\TorchTrainer_2023-11-22_09-30-02'

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮

│anonymous_user_directory\Downloads\driver_fatigue_detection\mlops\train.py:238 in train_model │ │ 235 │ ) │ │ 236 │ │ │ 237 │ # Train │ │ ❱ 238 │ results = trainer.fit() │ │ 239 │ d = { │ │ 240 │ │ "timestamp": datetime.datetime.now().strftime("%B %d, %Y %I:%M:%S %p"), │ │ 241 │ │ "run_id": utils.get_run_id(experiment_name=experiment_name, trial_id=results.met │ │ │

Expected: That the script works on Windows environments, the model is trained, and the resulting .json file is generated at the appropriate location.

Extra info: Running the script works when run on WSL. (Ubuntu 22.04.2 LTS (GNU/Linux 5.15.133.1-microsoft-standard-WSL2 x86_64)) The reason for this failure might be alluded to in this write-up

OS: Windows 11
Python: Python 3.10.12
Ray: 2.6.0
Torch: 2.0.0

davbuf commented 9 months ago

Hi,

It seems Ray cannot be run on Windows even some 'patches' may exist with a more recent version (e.g. 2.7.2): https://discuss.ray.io/t/ray-tune-and-ray-train-not-working-with-windows-path-storage-path/12263

I hope it will solve your issue. best

allannof commented 9 months ago

For whoever that wishes to take this issue:

Simply installing for instance version 2.7.2 or 2.8.0 of Ray will not work, since there are some changes in later versions of how data sets are configured. Appropriate changes to the code must be made. See this piece of documentation

davbuf / driver_fatigue_detection

Problem running ML Ops training script on Windows #2