Open-EO / openeo-geotrellis-extensions

Java/Scala extensions for Geotrellis, for use with OpenEO GeoPySpark backend.
Apache License 2.0
5 stars 4 forks source link

Move file to results #335

Closed EmileSonneveld closed 1 month ago

EmileSonneveld commented 1 month ago

There is already a version that only tries the move twice on staging. This will make 5 tries and is a bit more thread-safe

EmileSonneveld commented 1 month ago

This should also fix the error in integration tests. openeo.rest.OpenEoApiError: [500] Internal: Server error: PermissionError(13, 'Permission denied') (ref: r-241016413b7f4eae8ed54d00388a426f)

Files.createTempFile makes an empty file with by default very tight permissions. After a move the permissions stay the same. Now, no default empty file is created, and GDAL writes a file with more sensible permissions

EmileSonneveld commented 1 month ago

@bossie had a nice idea to avoid the retry logic: Let executors write to unique paths, and send it to the driver. The driver chooses what files get moved to the final location, and then removes files from failed executors.

bossie commented 1 month ago

It's just that doing it twice apparently does not suffice, and why would doing it 5 times fix it?

If the problem is that executors are getting in each other's way (be it because of speculation or otherwise) maybe we should just avoid that.

I thought maybe this could work:

EmileSonneveld commented 1 month ago

The unique filenames solution is probably more robust, but a file move on S3 is more heavy. It will need to download and upload the file again, giving some delays and maybe will take a lot of memory again. (Not more than during job execution I guess, but stil)

jdries commented 1 month ago
  • the driver strips off the winning suffix, moves the file and deletes the losing files.

This also sounds feasible. Only have to be careful with the fact that users want to provide arbitrary asset names. Of course, if the suffix uses a fixed number of characters, then I guess it's fool proof?

EmileSonneveld commented 1 month ago

I'll start implementing that one. If re-enabling the fusemount is urgent, we can maybe get this one trough already. Or at least the permissions fix

jdries commented 1 month ago

@EmileSonneveld yes perhaps merge this one already. I'd like to avoid too much delay.