Closed EmileSonneveld closed 1 week ago
Status: Disabling fuse mount makes all the files be written completely. CDSE staing still uses the fusemount, it could be used as a fallback. It can be quicly enabled/disabled by changing these lines: https://git.vito.be/projects/TPT/repos/os_creodias_openeo_k8s/browse/kube_resources/applications/openeo/values_cdse-prod.yaml#4-7 And running the promote job again.
Disabling fuse mount, and using S3 directly, might cause issues with export_workspace
FileChannel.open(Path.of(path)).force(...)
is an example taken from this library: https://github.com/eclipse-rdf4j/rdf4j/blob/main/core/common/io/src/main/java/org/eclipse/rdf4j/common/io/NioFile.java#L164C5-L164C7
Job got trough with the file move way. Executors got OOM a few times. This might be the initial reason for the incomplete output files.
With a file-move, the files are first written to the pod's /tmp
directory tough.
Oct 15, 2024 @ 16:05:23.716 INFO stitchAndWriteToTiff writeGeoTiff done. filePath: /batch_jobs/j-241015d4f747427882375efb47c311db/openEO_VH_on_VV_P75.tif package.scala
Oct 15, 2024 @ 16:05:21.903 INFO FileAlreadyExistsException. Will overwrite file: /batch_jobs/j-241015d4f747427882375efb47c311db/openEO_VH_on_VV_P90.tif package.scala
Oct 15, 2024 @ 16:05:18.156 INFO FileAlreadyExistsException. Will overwrite file: /batch_jobs/j-241015d4f747427882375efb47c311db/openEO_VV_P75.tif package.scala
Oct 15, 2024 @ 16:05:17.448 INFO FileAlreadyExistsException. Will overwrite file: /batch_jobs/j-241015d4f747427882375efb47c311db/openEO_VV_P50.tif package.scala
Curious observation, the file permissions in the fusemount changed over time:
kubectl *** -- /bin/bash
bash-4.4$ cd /batch_jobs/j-241015d4f747427882375efb47c311db/
drwxr-xr-x. 2 spark spark 4096 Oct 15 14:04 .
drwxr-xr-x. 1 spark spark 48 Oct 15 14:02 ..
-rw-rw-r--. 1 spark spark 13973 Oct 15 14:01 job_metadata.json
-rw-rw-r--. 1 spark spark 2234 Oct 15 14:00 job_specification.json
-rw-------. 1 spark spark 114825818 Oct 15 14:04 openEO_VH_P10.tif
-rw-r--r--. 1 spark spark 114858844 Oct 15 14:04 openEO_VH_P25.tif
-rw-r--r--. 1 spark spark 113986724 Oct 15 14:04 openEO_VH_P90.tif
-rw-r--r--. 1 spark spark 111953331 Oct 15 14:04 openEO_VH_on_VV_P10.tif
-rw-r--r--. 1 spark spark 111785039 Oct 15 14:04 openEO_VH_on_VV_P25.tif
-rw-r--r--. 1 spark spark 111524860 Oct 15 14:04 openEO_VH_on_VV_P50.tif
-rw-r--r--. 1 spark spark 114393944 Oct 15 14:04 openEO_VV_P10.tif
-rw-r--r--. 1 spark spark 114498596 Oct 15 14:04 openEO_VV_P25.tif
-rw-r--r--. 1 spark spark 114298964 Oct 15 14:04 openEO_VV_P90.tif
bash-4.4$ ls -al
total 1110032
drwxr-xr-x. 2 spark spark 4096 Oct 15 14:05 .
drwxr-xr-x. 1 spark spark 48 Oct 15 14:02 ..
-rw-r--r--. 1 spark spark 1628 Oct 15 14:05 collection.json
-rw-rw-r--. 1 spark spark 23811 Oct 15 14:05 job_metadata.json
-rw-rw-r--. 1 spark spark 2234 Oct 15 14:00 job_specification.json
-rw--w----. 1 spark spark 114825818 Oct 15 14:04 openEO_VH_P10.tif
-rw-r--r--. 1 spark spark 395 Oct 15 14:05 openEO_VH_P10.tif.aux.xml
-rw-r--r--. 1 spark spark 468 Oct 15 14:05 openEO_VH_P10.tif.json
-rw--w----. 1 spark spark 114858844 Oct 15 14:04 openEO_VH_P25.tif
-rw-r--r--. 1 spark spark 395 Oct 15 14:05 openEO_VH_P25.tif.aux.xml
-rw-r--r--. 1 spark spark 468 Oct 15 14:05 openEO_VH_P25.tif.json
-rw-r--r--. 1 spark spark 468 Oct 15 14:05 openEO_VH_P50.tif.json
-rw-r--r--. 1 spark spark 468 Oct 15 14:05 openEO_VH_P75.tif.json
-rw--w----. 1 spark spark 113986724 Oct 15 14:04 openEO_VH_P90.tif
-rw-r--r--. 1 spark spark 395 Oct 15 14:05 openEO_VH_P90.tif.aux.xml
-rw-r--r--. 1 spark spark 468 Oct 15 14:05 openEO_VH_P90.tif.json
-rw--w----. 1 spark spark 111953331 Oct 15 14:04 openEO_VH_on_VV_P10.tif
-rw-r--r--. 1 spark spark 394 Oct 15 14:05 openEO_VH_on_VV_P10.tif.aux.xml
-rw-r--r--. 1 spark spark 492 Oct 15 14:05 openEO_VH_on_VV_P10.tif.json
-rw--w----. 1 spark spark 111785039 Oct 15 14:04 openEO_VH_on_VV_P25.tif
-rw-r--r--. 1 spark spark 394 Oct 15 14:05 openEO_VH_on_VV_P25.tif.aux.xml
-rw-r--r--. 1 spark spark 492 Oct 15 14:05 openEO_VH_on_VV_P25.tif.json
-rw--w----. 1 spark spark 111524860 Oct 15 14:04 openEO_VH_on_VV_P50.tif
-rw-r--r--. 1 spark spark 394 Oct 15 14:05 openEO_VH_on_VV_P50.tif.aux.xml
-rw-r--r--. 1 spark spark 492 Oct 15 14:05 openEO_VH_on_VV_P50.tif.json
-rw-r--r--. 1 spark spark 492 Oct 15 14:05 openEO_VH_on_VV_P75.tif.json
-rw-r--r--. 1 spark spark 492 Oct 15 14:05 openEO_VH_on_VV_P90.tif.json
-rw--w----. 1 spark spark 114393944 Oct 15 14:04 openEO_VV_P10.tif
-rw-r--r--. 1 spark spark 392 Oct 15 14:05 openEO_VV_P10.tif.aux.xml
-rw-r--r--. 1 spark spark 468 Oct 15 14:05 openEO_VV_P10.tif.json
-rw--w----. 1 spark spark 114498596 Oct 15 14:04 openEO_VV_P25.tif
-rw-r--r--. 1 spark spark 390 Oct 15 14:05 openEO_VV_P25.tif.aux.xml
-rw-r--r--. 1 spark spark 468 Oct 15 14:05 openEO_VV_P25.tif.json
-rw--w----. 1 spark spark 114498158 Oct 15 14:05 openEO_VV_P50.tif
-rw-r--r--. 1 spark spark 391 Oct 15 14:05 openEO_VV_P50.tif.aux.xml
-rw-r--r--. 1 spark spark 468 Oct 15 14:05 openEO_VV_P50.tif.json
-rw-r--r--. 1 spark spark 468 Oct 15 14:05 openEO_VV_P75.tif.json
-rw--w----. 1 spark spark 114298964 Oct 15 14:04 openEO_VV_P90.tif
-rw-r--r--. 1 spark spark 392 Oct 15 14:05 openEO_VV_P90.tif.aux.xml
-rw-r--r--. 1 spark spark 468 Oct 15 14:05 openEO_VV_P90.tif.json
bash-4.4$ command terminated with exit code 137
emile@emile-Precision-7680:~$ kubectl --kubeconfig ~/.kube/cdse_dev.yml -n spark-jobs-dev exec -it a-8dec3512508e410e878e3742c07ef718-driver -c spark-kubernetes-driver -- /bin/bash
bash-4.4$ cd /batch_jobs/j-241022e02b9745ce8438c66bbffbb5f9
bash-4.4$ date --iso-8601=seconds && ls -al
2024-10-22T15:05:50+00:00
total 667033
drwxr-xr-x. 2 spark spark 4096 Oct 22 15:05 .
drwxr-xr-x. 1 spark spark 48 Oct 22 14:59 ..
-rw-rw-r--. 1 spark spark 110510198 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif
-rw-rw-r--. 1 spark spark 115616258 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif
-rw-rw-r--. 1 spark spark 114191269 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif
-rw-rw-r--. 1 spark spark 115283524 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif
-rw-rw-r--. 1 spark spark 113842171 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif
-rw-rw-r--. 1 spark spark 113548381 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif
-rw-rw-r--. 1 spark spark 40783 Oct 22 15:05 job_metadata.json
-rw-rw-r--. 1 spark spark 2487 Oct 22 14:59 job_specification.json
bash-4.4$ date --iso-8601=seconds && ls -al
2024-10-22T15:05:53+00:00
total 667055
drwxr-xr-x. 2 spark spark 4096 Oct 22 15:05 .
drwxr-xr-x. 1 spark spark 48 Oct 22 14:59 ..
-rw-r--r--. 1 spark spark 635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P10.tif.json
-rw-r--r--. 1 spark spark 635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P25.tif.json
-rw-r--r--. 1 spark spark 635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P50.tif.json
-rw-rw-r--. 1 spark spark 110510198 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif
-rw-r--r--. 1 spark spark 389 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif.aux.xml
-rw-r--r--. 1 spark spark 834 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif.json
-rw-r--r--. 1 spark spark 635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P90.tif.json
-rw-rw-r--. 1 spark spark 115616258 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif
-rw-r--r--. 1 spark spark 391 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif.aux.xml
-rw-r--r--. 1 spark spark 796 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif.json
-rw-r--r--. 1 spark spark 603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P25.tif.json
-rw-r--r--. 1 spark spark 603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P50.tif.json
-rw-rw-r--. 1 spark spark 114191269 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif
-rw-r--r--. 1 spark spark 392 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif.aux.xml
-rw-r--r--. 1 spark spark 797 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif.json
-rw-r--r--. 1 spark spark 603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P90.tif.json
-rw-rw-r--. 1 spark spark 115283524 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif
-rw-r--r--. 1 spark spark 390 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif.aux.xml
-rw-r--r--. 1 spark spark 795 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif.json
-rw-r--r--. 1 spark spark 603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P25.tif.json
-rw-r--r--. 1 spark spark 603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P50.tif.json
-rw-rw-r--. 1 spark spark 113842171 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif
-rw-r--r--. 1 spark spark 389 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif.aux.xml
-rw-r--r--. 1 spark spark 794 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif.json
-rw-rw-r--. 1 spark spark 113548381 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif
-rw-r--r--. 1 spark spark 389 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif.aux.xml
-rw-r--r--. 1 spark spark 794 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif.json
-rw-r--r--. 1 spark spark 2313 Oct 22 15:05 collection.json
-rw-rw-r--. 1 spark spark 42573 Oct 22 15:05 job_metadata.json
-rw-rw-r--. 1 spark spark 2487 Oct 22 14:59 job_specification.json
bash-4.4$ command terminated with exit code 137
But Kibana shows that the driver did not found the path even after the fusemount in the same pod showed it existed. It is only 1 seconds apart, so maybe the timestamps are a bit offseted.
Oct 22, 2024 @ 17:05:54.290 ERROR OpenEO batch job failed: "[Errno 2] No such file or directory: '/batch_jobs/j-241022e02b9745ce8438c66bbffbb5f9/LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P10.tif'"
job_id j-241022e02b9745ce8438c66bbffbb5f9
kubernetes.pod_name a-8dec3512508e410e878e3742c07ef718-driver
Will try with a wait loop now
Retrying did work on CDSE dev:
Oct 22, 2024 @ 18:44:28.795 INFO Waiting for path to be available. Try 2/5: /batch_jobs/j-2410227f4b0d4af6876341faba4c976d/openEO_2023-06-04Z_B02.tif
Oct 22, 2024 @ 18:44:18.795 INFO Waiting for path to be available. Try 1/5: /batch_jobs/j-2410227f4b0d4af6876341faba4c976d/openEO_2023-06-04Z_B02.tif
TODO: dedup time_machine test code https://github.com/Open-EO/openeo-geopyspark-driver/pull/916#discussion_r1813028019
FYI a test in this integration tests run failed (Kibana&_a=(columns:!(message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'907d8590-4f1b-11ed-8cc4-3747d5233c59',key:job_id,negate:!f,params:(query:j-241028ce05c64facb4a37ce5b4241fdc),type:phrase),query:(match:(job_id:(query:j-241028ce05c64facb4a37ce5b4241fdc,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'907d8590-4f1b-11ed-8cc4-3747d5233c59',key:levelname,negate:!f,params:(query:ERROR),type:phrase),query:(match:(levelname:(query:ERROR,type:phrase))))),index:'907d8590-4f1b-11ed-8cc4-3747d5233c59',interval:auto,query:(language:kuery,query:''),sort:!(!('@timestamp',desc))))) with:
java.nio.file.FileSystemException: /data/projects/OpenEO/j-241028ce05c64facb4a37ce5b4241fdc/openEO_2018-01-01Z.tif: Stale file handle
There's also this one where apparently two executors attempted to write the same output asset (Kibana&_a=(columns:!(message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'907d8590-4f1b-11ed-8cc4-3747d5233c59',key:job_id,negate:!f,params:(query:j-2410284b467d4ac68e183a8144480bc8),type:phrase),query:(match:(job_id:(query:j-2410284b467d4ac68e183a8144480bc8,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'907d8590-4f1b-11ed-8cc4-3747d5233c59',key:levelname,negate:!f,params:!(ERROR,WARNING,INFO),type:phrases,value:'ERROR,%20WARNING,%20INFO'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(levelname:ERROR)),(match_phrase:(levelname:WARNING)),(match_phrase:(levelname:INFO))))))),index:'907d8590-4f1b-11ed-8cc4-3747d5233c59',interval:auto,query:(language:kuery,query:''),sort:!(!('@timestamp',desc))))) but it ultimately went missing.
Ok, then the write per executor and move/copy is really needed.
The driver side in Scala will need a wait_till_path_available
function too then
Observer errors with the moveOverwriteWithRetries
implementation:
Found by Peter:
Stage error: Job aborted due to stage failure: Task 4 in stage 38.0 failed 4 times, most recent failure: Lost task 4.3 in stage 38.0 (TID 1186) (10.42.7.154 executor 2): java.io.IOException: Resource temporarily unavailable
at java.base/sun.nio.ch.FileDispatcherImpl.force0(Native Method)
at java.base/sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:82)
at java.base/sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:461)
at org.openeo.geotrellis.geotiff.package$.writeGeoTiff(package.scala:862)
at org.openeo.geotrellis.geotiff.package$.writeTiff(package.scala:602)
at org.openeo.geotrellis.geotiff.package$.$anonfun$saveRDDTemporalAllowAssetPerBand$4(package.scala:191)
This might be due to a flaky s3 connection or a racing condition between executors. It might be good to put the scala FSYNC under a retry case, and check if the error occurs again
In test_load_collection_references_correct_batch_process_id:
sun.nio.fs.UnixException: No such file or directory
Task error: ExceptionFailure(java.nio.file.FileSystemException,/data/projects/OpenEO/j-2411054cbc7d4d0c868e3698acac18d3/openEO_2018-01-01Z.tif: Stale file handle,[Ljava.lang.StackTraceElement;@5292477a,java.nio.file.FileSystemException: /data/projects/OpenEO/j-2411054cbc7d4d0c868e3698acac18d3/openEO_2018-01-01Z.tif: Stale file handle
Example graph that uses
separate_asset_per_band
and has empty tiff files:j-241009a45a764383a3a3db1453b9881f
Making the batchjob write to S3 directly instead of the fuse mount avoids this issue. Need to check if fsync also avoids the issue: https://github.com/yandex-cloud/geesefs/blob/master/README.md?plain=1#L279-L299https://teams.microsoft.com/l/message/19:2941a270bf8e48a2a8e8a23975051c11@thread.skype/1728034200593?tenantId=9e2777ed-8237-4ab9-9278-2c144d6f6da3&groupId=8c9c739d-2544-4def-8cd4-b65970551b70&parentMessageId=1728034200593&teamName=Unit%20TAP&channelName=openEO-users&createdTime=1728034200593