DDMAL / Rodan

:dragon_face: A web-based workflow engine.
https://rodan2.simssa.ca/
45 stars 13 forks source link

cannot run PACO train on prod (vGPU server) #1181

Open homework36 opened 2 weeks ago

homework36 commented 2 weeks ago

We can clearly access GPU now with production server. We are able to do classifying, but PACO training always fails with such message, with the same workflow and input files finished successfully on staging:

Task Training model for Patchwise Analysis of Music Document, Training[eacd36d5-c8dd-4b02-b9cd-38ca31c92959] raised unexpected: RuntimeError("The job did not produce the output file for Background Model.\n\n{'Log File': [{'resource_type': 'text/plain', 'uuid': UUID('c031c7c9-c86d-481f-ae62-59f0b2491828'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/0fd193e7-8b3d-45a0-84d8-b99c0b2b8fc0'}], 'Background Model': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('c7a3b8ca-429a-4fcd-be92-2957e00497ba'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/faebcd1f-c384-44b9-9572-25f5ac27b12e'}], 'Model 1': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('eea42920-5100-4a57-9604-7688d582b482'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/a69ba529-0053-4c7f-bc3d-333942300b15'}], 'Model 2': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('3fd388db-1cf3-4400-9fb7-712bd3f6738e'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/03ac3bb9-3323-42a0-b989-5eadae3a0529'}]}")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/code/Rodan/rodan/jobs/base.py", line 843, in run
    ).format(opt_name, outputs)
RuntimeError: The job did not produce the output file for Background Model.

Thought it was a out-of-memory issue, so I didn't think too much of it as we are still waiting for the larger vGPU instance. However, after testing and further looking into this, it seems a different thing. Closed the vGPU driver issue (#1170) and work on this instead.

Related:

I'm baffled. The directory tmpa8eq61gp was created successfully with all necessary permissions, but no hdf5 files were written during the process. There was also no other logs or error messages to help identify the exact cause. Since the same thing can run without any issue on staging, I don't think it's a bug from the PACO repo. /rodan-main/code/rodan/jobs/base.py was also just checking if the file exists. So it looks like it might be a bug with the Rodan PACO wrapper or something else.

This is strange also because, if the training has not started, it does not write hdf5 yet, of course. However, in this case, it is the missing output file that seems to prevent training.

The same error is reproduced on local machine with intel chip (where we don't have the GPU container problem for M for arm machines).