dlr-eoc / prosEO

prosEO – A Processing System for Earth Observation Data
GNU General Public License v3.0
14 stars 1 forks source link

Storage Manager: Downloaded files not fully accessible by processing jobs #150

Closed tangobravo62 closed 2 years ago

tangobravo62 commented 3 years ago

The following was observed:

Step 1: The wrapper downloads a file from the Storage Manager:

2021-05-29T08:56:18.245 s5p-proseo-vdafop-wrapper 01.12.00: [I] GET http://storage-mgr-service.default.svc.cluster.local:3000/proseo/storage-mgr/v0.1/productfiles?pathInfo=s3%3A%2F%2Fproseo-data-001%2F138853%2FS5P_TEST_L2__AER_AI_20190509T220707_20190509T234837_08137_70_020200_20210529T084430.nc
2021-05-29T08:56:18.841 s5p-proseo-vdafop-wrapper 01.12.00: [I] Fetched 1 input files and prepared directories for 1 outputs -- Ready for processing using Container-JOF /opt/s5p/sw/JobOrder.176762365.xml

When the REST request returns after 0.6 s, the file should be fully written on the persistent volume that is accessed by both the Storage Manager and the processing job. However, the processor CFI throws an error when accessing the file:

2021-05-29T08:56:19.104343 proseo-val-k8s-node-nf-1 VDAFOP 01.13.00 [000000000028]: [I] input product is S5P_TEST_L2__AER_AI_20190509T220707_20190509T234837_08137_70_020200_20210529T084430.nc
[...]
2021-05-29T08:56:19.255667 proseo-val-k8s-node-nf-1 VDAFOP 01.13.00 [000000000028]: [E] [HDF5] H5F_sblock_load(): truncated file: eof = 93732864, sblock->base_addr = 0, stored_eoa = 131002532 (major="File accessibilty", minor="File has been truncated") (H5Fsuper_cache.c:474)

The error message shows that the file is truncated. Checking the file system for the file modification dates shows that the file was modified 500 ms after the access attempt by the processing job:

[linux@proseo-val-bastion-1 proseodata]$ ll --time-style full-iso 138853/S5P_TEST_L2__AER_AI_20190509T220707_20190509T234837_08137_70_020200_20210529T084430.nc
-rw-rw-rw-. 1 nfsnobody nfsnobody 131002532 2021-05-29 08:56:19.716979037 +0000 138853/S5P_TEST_L2__AER_AI_20190509T220707_20190509T234837_08137_70_020200_20210529T084430.nc

In a first attempt to solve this issue, the implementation in the Storage Manager was changed from performing the download fully in the S3 API to retrieving an input stream and storing this under the control of the Storage Manager (from commit 7e1e563):

-           s3.getObject(GetObjectRequest.builder().bucket(s3uri.getBucket()).key(s3uri.getKey()).build(),
-                   ResponseTransformer.toFile(Paths.get(ContainerPath)));
+           
+           ResponseInputStream<GetObjectResponse> is = s3.getObject(GetObjectRequest.builder().bucket(s3uri.getBucket()).key(s3uri.getKey()).build());
+           if (is != null) {
+               Boolean hasError = false;
+               try {
+                   Files.copy(is, Paths.get(ContainerPath));
+               } catch (IOException e) {
+                   hasError = true;
+                   logger.error("Failed to copy " + s3Object + " to " + "file://" + ContainerPath + " | " + e.getMessage());                
+               }
+               try {
+                   is.close();
+               } catch (IOException e) {
+                   hasError = true;
+                   logger.error("Failed to close input stream of " + s3Object + " | " + e.getMessage());
+               }
+               if (hasError) {
+                   return hasError;
+               }
+           }

Apparently from the perspective of the Storage Manager and its operating system, the file has been fully written to disk, but NFS does some sort of caching, which prevents the data from being provided to other NFS clients in time.

emelchinger commented 3 years ago

Check size of files copied between storage manager and wrapper by using NFS share. This is necessary to avoid problems due to client caching.

The maximum cycle count and the cycle wait time is defined for storageManager and productionPlanner in application.yml by: filecheckwaittime: 1000 filecheckmaxcycles: 50

The base wrapper needs:

org.jboss.resteasy
        <artifactId>resteasy-client</artifactId>
        <version>3.0.2.Final</version>
tangobravo62 commented 2 years ago

Alas the implementation above did not resolve the problem. It resurfaced on a different infrastructure now. Some more debugging yielded the following results:

(1) Wrapper log downloading file:

2021-09-16T18:03:39.594 s5p-proseo-vdafop-wrapper 01.12.00: [I] GET http://10.8.212.13:8080/proseo/storage-mgr/v0.1/productfiles?pathInfo=s3%3A%2F%2Fproseo-val-backend%2F36785%2FS5P_TEST_L2__CLOUD__20200930T223846_20201001T002015_15373_02_020201_20210916T174112.nc
2021-09-16T18:03:44.090 s5p-proseo-vdafop-wrapper 01.12.00: [D] response = org.jboss.resteasy.client.jaxrs.engines.ApacheHttpClient4Engine$1@2462cb01
2021-09-16T18:03:44.228 s5p-proseo-vdafop-wrapper 01.12.00: [D] ... Testing wait for download of /proseo/data/36785/S5P_TEST_L2__CLOUD__20200930T223846_20201001T002015_15373_02_020201_20210916T174112.nc, wait interval 1000, max cycles 50; size is 294084608, expected 271268369

So the Wrapper starts the download at 2021-09-16 18:03:39.594 UTC and gets the completion response from the Storage Manager roughly 4 s later, at 18:03:44.090. The following debug message shows that the HTTP response returned a file size of 271,268,369 bytes, but at that moment the actual size already was 294,084,608, so obviously additional data had been written in the meantime, and the size returned by the Storage Manager was incorrect.

(2) Error from processor when accessing the file:

2021-09-16T18:03:44.667737 worker1 VDAFOP 01.13.00 [000000000028]: [E] [HDF5] H5F_sblock_load(): truncated file: eof = 358365713, sblock->base_addr = 0, stored_eoa = 476209312 (major="File accessibilty", minor="File has been truncated") (H5Fsuper_cache.c:474)

The processor apparently expects a file size of 476,209,312 bytes, but at the time of accessing the file, it only finds 358,365,713 bytes and aborts with an error message.

(3) Actual file size and time stamp on file system:

-rw-rw-rw-.    1 root root 476209312 2021-09-16 18:03:45.458296683 +0000 S5P_TEST_L2__CLOUD__20200930T223846_20201001T002015_15373_02_020201_20210916T174112.nc

It is only at 18:03:45.458 that the file is completely written. The file has a size of 476,209,312 bytes as expected by the processor.

The resolution approach should be to ensure that the Storage Manager does not return before the file has been fully written to disk. The Wrapper may still check whether it has the correct file size to handle cases, in which the Storage Manager does not actually write to disk, but to an NFS file share as well with possible buffering by its NFS client.

tangobravo62 commented 2 years ago

Implemented a wait routine in Storage Manager after transfer of file from S3 storage comparing actual file size with file size in content header of S3 HTTP response. Response to client is delayed by multiples of 500 ms until file is completely written.

tangobravo62 commented 2 years ago

Wait routine apparently works well:

2021-09-17 16:11:25.544 TRACE 1 --- [nio-8080-exec-6] de.dlr.proseo.storagemgr.fs.s3.S3Ops     : >>> v2FetchFile(s3, s3://proseo-val-backend/37188/S5P_TEST_L2__CLOUD__20201001T002015_20201001T020145_15374_01_020201_20210917T154634.nc, /proseo/data/37188/S5P_TEST_L2__CLOUD__20201001T002015_20201001T020145_15374_01_020201_20210917T154634.nc)
2021-09-17 16:11:30.043  INFO 1 --- [nio-8080-exec-6] de.dlr.proseo.storagemgr.fs.s3.S3Ops     : ... waiting to complete writing of /proseo/data/37188/S5P_TEST_L2__CLOUD__20201001T002015_20201001T020145_15374_01_020201_20210917T154634.nc
2021-09-17 16:11:30.543  INFO 1 --- [nio-8080-exec-6] de.dlr.proseo.storagemgr.fs.s3.S3Ops     : ... waiting to complete writing of /proseo/data/37188/S5P_TEST_L2__CLOUD__20201001T002015_20201001T020145_15374_01_020201_20210917T154634.nc
2021-09-17 16:11:31.043  INFO 1 --- [nio-8080-exec-6] de.dlr.proseo.storagemgr.fs.s3.S3Ops     : ... waiting to complete writing of /proseo/data/37188/S5P_TEST_L2__CLOUD__20201001T002015_20201001T020145_15374_01_020201_20210917T154634.nc
2021-09-17 16:11:31.544  INFO 1 --- [nio-8080-exec-6] de.dlr.proseo.storagemgr.fs.s3.S3Ops     : Copied S3 object s3://proseo-val-backend/37188/S5P_TEST_L2__CLOUD__20201001T002015_20201001T020145_15374_01_020201_20210917T154634.nc to file /proseo/data/37188/S5P_TEST_L2__CLOUD__20201001T002015_20201001T020145_15374_01_020201_20210917T154634.nc