deepforge-dev / deepforge

A modern development environment for deep learning
https://deepforge.org
Apache License 2.0
753 stars 78 forks source link

GetCifar10Data operation (in CIFAR10 example) fails with GME storage #1866

Open brollb opened 4 years ago

brollb commented 4 years ago

There are really two issues here. One is that the following error occurs when using GME storage: DeepinScreenshot_select-area_20200818085310

The second issue is that this error isn't handled well when running the pipeline and it just seems to stop. Errors thrown when uploading the resulting data should report the operation as failed and show the error.

umesh-timalsina commented 4 years ago

For me using GMEStorage with server execution, somehow, the file results.json is never written and this causes the LocalExecutor to fail because it cannot find the file.

brollb commented 4 years ago

This appears to be an issue with the BlobClient's putFile method failing silently when uploading a stream. Generally, when the job shows up as a success but results.json is not written, it means that something failed when uploading the results to their corresponding storage backends.

brollb commented 4 years ago

As this is caused by a dependency, I am going to remove this from the milestone so it doesn't block the other bug fixes marked for v2.4.1

umesh-timalsina commented 4 years ago

This is failing for me even using the sciserver-files service.

umesh-timalsina commented 3 years ago

Is this resolved?

brollb commented 3 years ago

Unfortunately not. The first portion of it (error handling) was fixed in the referenced PR from webgme-engine but the CONNRESET errors still happen for me.

I had deprioritized this since it wasn't an issue for the deployment and was specific to the GME storage backend but I will make an issue on webgme-engine now.

brollb commented 3 years ago

Just opened an issue: https://github.com/webgme/webgme-engine/issues/240