Closed EmileSonneveld closed 1 month ago
This should also fix the error in integration tests.
openeo.rest.OpenEoApiError: [500] Internal: Server error: PermissionError(13, 'Permission denied') (ref: r-241016413b7f4eae8ed54d00388a426f)
Files.createTempFile
makes an empty file with by default very tight permissions. After a move the permissions stay the same. Now, no default empty file is created, and GDAL writes a file with more sensible permissions
@bossie had a nice idea to avoid the retry logic: Let executors write to unique paths, and send it to the driver. The driver chooses what files get moved to the final location, and then removes files from failed executors.
It's just that doing it twice apparently does not suffice, and why would doing it 5 times fix it?
If the problem is that executors are getting in each other's way (be it because of speculation or otherwise) maybe we should just avoid that.
I thought maybe this could work:
The unique filenames solution is probably more robust, but a file move on S3 is more heavy. It will need to download and upload the file again, giving some delays and maybe will take a lot of memory again. (Not more than during job execution I guess, but stil)
- the driver strips off the winning suffix, moves the file and deletes the losing files.
This also sounds feasible. Only have to be careful with the fact that users want to provide arbitrary asset names. Of course, if the suffix uses a fixed number of characters, then I guess it's fool proof?
I'll start implementing that one. If re-enabling the fusemount is urgent, we can maybe get this one trough already. Or at least the permissions fix
@EmileSonneveld yes perhaps merge this one already. I'd like to avoid too much delay.
There is already a version that only tries the move twice on staging. This will make 5 tries and is a bit more thread-safe