COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG][L1] PSC_PreprocMain segfault error #815

Closed Woljtek closed 1 year ago

Woljtek commented 1 year ago

Environment:

Relative user Stories:

Current Behavior: All job received by S1_L1 ew ends with the error 139:

2023-02-06T09:08:07.597 | INFO  | e.s.c.i.e.w.j.p.TaskCallable [pool-18937-thread-1]: Ending task /usr/local/components/S1IPF/bin/PSC_PreprocMain with exit code 139

On node, a log indicates a segmentation fault of PSC_PreprocMain at the same moment.:

Feb  6 09:08:07 cluster-ops-node-103 kernel: [342689.653353] PSC_PreprocMain[3006605]: segfault at 7ffc243c5ff8 ip 0000000000409250 sp 00007ffc243c6000 error 6 in PSC_PreprocMain[400000+2cbf000]

Expected Behavior: The job shall produces S1_L1 products or generates a functional error (127 & 128 according to S1-L1 ICD MPC-265 v1.12).

Steps To Reproduce: Systematic

Test execution artefacts (i.e. logs, screenshots…)

Extract syslog of node-104:

F[https://app.zenhub.com/files/398313496/ad12c5d5-3eef-46f3-a316-8cb70769c1d0/download](https://app.zenhub.com/files/398313496/ad12c5d5-3eef-46f3-a316-8cb70769c1d0/download)eb  6 08:57:45 cluster-ops-node-103 kernel: [342067.781381] PSC_PreprocMain[3001139]: segfault at 7ffdc24285f8 ip 0000000000409250 sp 00007ffdc2428600 error 6 in PSC_PreprocMain[400000+2cbf000]
Feb  6 08:57:45 cluster-ops-node-103 kernel: [342067.781387] Code: 1f 00 e9 73 ff ff ff 0f 1f 00 55 48 89 e5 48 83 e4 80 41 54 41 56 48 81 ec 70 1b a9 1f 49 89 f6 41 89 fc 33 f6 bf 03 00 00 00 <e8> eb db 2a 02 0f ae 1c 24 81 0c 24 40 80 00 00 0f ae 14 24 80 3d
Feb  6 08:59:38 cluster-ops-node-103 kernel: [342180.337895] PSC_PreprocMain[3002163]: segfault at 7ffd100013f8 ip 0000000000409250 sp 00007ffd10001400 error 6 in PSC_PreprocMain[400000+2cbf000]
Feb  6 08:59:38 cluster-ops-node-103 kernel: [342180.337902] Code: 1f 00 e9 73 ff ff ff 0f 1f 00 55 48 89 e5 48 83 e4 80 41 54 41 56 48 81 ec 70 1b a9 1f 49 89 f6 41 89 fc 33 f6 bf 03 00 00 00 <e8> eb db 2a 02 0f ae 1c 24 81 0c 24 40 80 00 00 0f ae 14 24 80 3d
Feb  6 09:01:34 cluster-ops-node-103 kernel: [342296.809293] PSC_PreprocMain[3003144]: segfault at 7ffd829fe8f8 ip 0000000000409250 sp 00007ffd829fe900 error 6 in PSC_PreprocMain[400000+2cbf000]
Feb  6 09:01:34 cluster-ops-node-103 kernel: [342296.809300] Code: 1f 00 e9 73 ff ff ff 0f 1f 00 55 48 89 e5 48 83 e4 80 41 54 41 56 48 81 ec 70 1b a9 1f 49 89 f6 41 89 fc 33 f6 bf 03 00 00 00 <e8> eb db 2a 02 0f ae 1c 24 81 0c 24 40 80 00 00 0f ae 14 24 80 3d
Feb  6 09:03:19 cluster-ops-node-103 kernel: [342401.516340] PSC_PreprocMain[3004087]: segfault at 7ffde55c0778 ip 0000000000409250 sp 00007ffde55c0780 error 6 in PSC_PreprocMain[400000+2cbf000]
Feb  6 09:03:19 cluster-ops-node-103 kernel: [342401.516346] Code: 1f 00 e9 73 ff ff ff 0f 1f 00 55 48 89 e5 48 83 e4 80 41 54 41 56 48 81 ec 70 1b a9 1f 49 89 f6 41 89 fc 33 f6 bf 03 00 00 00 <e8> eb db 2a 02 0f ae 1c 24 81 0c 24 40 80 00 00 0f ae 14 24 80 3d
Feb  6 09:05:03 cluster-ops-node-103 kernel: [342505.730457] PSC_PreprocMain[3004975]: segfault at 7ffec0ff8678 ip 0000000000409250 sp 00007ffec0ff8680 error 6 in PSC_PreprocMain[400000+2cbf000]
Feb  6 09:05:03 cluster-ops-node-103 kernel: [342505.730462] Code: 1f 00 e9 73 ff ff ff 0f 1f 00 55 48 89 e5 48 83 e4 80 41 54 41 56 48 81 ec 70 1b a9 1f 49 89 f6 41 89 fc 33 f6 bf 03 00 00 00 <e8> eb db 2a 02 0f ae 1c 24 81 0c 24 40 80 00 00 0f ae 14 24 80 3d
Feb  6 09:05:23 cluster-ops-node-103 systemd[1]: run-containerd-runc-k8s.io-7fcced5c8b66dacfd768fbb00a8ae1a468117e2380a297c4ccabe5a7485d5797-runc.5pvgRO.mount: Succeeded.
Feb  6 09:06:34 cluster-ops-node-103 kernel: [342596.982401] PSC_PreprocMain[3005817]: segfault at 7fff825e8e78 ip 0000000000409250 sp 00007fff825e8e80 error 6 in PSC_PreprocMain[400000+2cbf000]
Feb  6 09:06:34 cluster-ops-node-103 kernel: [342596.982407] Code: 1f 00 e9 73 ff ff ff 0f 1f 00 55 48 89 e5 48 83 e4 80 41 54 41 56 48 81 ec 70 1b a9 1f 49 89 f6 41 89 fc 33 f6 bf 03 00 00 00 <e8> eb db 2a 02 0f ae 1c 24 81 0c 24 40 80 00 00 0f ae 14 24 80 3d
Feb  6 09:08:07 cluster-ops-node-103 kernel: [342689.653353] PSC_PreprocMain[3006605]: segfault at 7ffc243c5ff8 ip 0000000000409250 sp 00007ffc243c6000 error 6 in PSC_PreprocMain[400000+2cbf000]
Feb  6 09:08:07 cluster-ops-node-103 kernel: [342689.653359] Code: 1f 00 e9 73 ff ff ff 0f 1f 00 55 48 89 e5 48 83 e4 80 41 54 41 56 48 81 ec 70 1b a9 1f 49 89 f6 41 89 fc 33 f6 bf 03 00 00 00 <e8> eb db 2a 02 0f ae 1c 24 81 0c 24 40 80 00 00 0f ae 14 24 80 3d
Feb  6 09:09:15 cluster-ops-node-103 systemd[1]: run-containerd-runc-k8s.io-7fcced5c8b66dacfd768fbb00a8ae1a468117e2380a297c4ccabe5a7485d5797-runc.Lnkdyc.mount: Succeeded.
Feb  6 09:09:52 cluster-ops-node-103 kubelet[23940]: I0206 09:09:52.067883   23940 prober.go:116] "Probe failed" probeType="Readiness" pod="logging/fluent-bit-z4rmj" podUID=176a5b0d-a25d-479f-bcb5-0125441908fe containerName="fluent-bit" probeResult=failure output="Get \"http://10.244.118.2:2020/api/v1/health\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Feb  6 09:09:52 cluster-ops-node-103 kubelet[23940]: I0206 09:09:52.067883   23940 prober.go:116] "Probe failed" probeType="Liveness" pod="logging/fluent-bit-z4rmj" podUID=176a5b0d-a25d-479f-bcb5-0125441908fe containerName="fluent-bit" probeResult=failure output="Get \"http://10.244.118.2:2020/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Feb  6 09:09:58 cluster-ops-node-103 kernel: [342800.433918] PSC_PreprocMain[3007556]: segfault at 7fffde30f2f8 ip 0000000000409250 sp 00007fffde30f300 error 6 in PSC_PreprocMain[400000+2cbf000]
Feb  6 09:09:58 cluster-ops-node-103 kernel: [342800.433925] Code: 1f 00 e9 73 ff ff ff 0f 1f 00 55 48 89 e5 48 83 e4 80 41 54 41 56 48 81 ec 70 1b a9 1f 49 89 f6 41 89 fc 33 f6 bf 03 00 00 00 <e8> eb db 2a 02 0f ae 1c 24 81 0c 24 40 80 00 00 0f ae 14 24 80 3d
Feb  6 09:12:01 cluster-ops-node-103 CRON[3008666]: (root) CMD (   test -x /etc/cron.daily/popularity-contest && /etc/cron.daily/popularity-contest --crond)
Feb  6 09:12:03 cluster-ops-node-103 kernel: [342925.567788] PSC_PreprocMain[3008693]: segfault at 7fff09ff43f8 ip 0000000000409250 sp 00007fff09ff4400 error 6 in PSC_PreprocMain[400000+2cbf000]
Feb  6 09:12:03 cluster-ops-node-103 kernel: [342925.567794] Code: 1f 00 e9 73 ff ff ff 0f 1f 00 55 48 89 e5 48 83 e4 80 41 54 41 56 48 81 ec 70 1b a9 1f 49 89 f6 41 89 fc 33 f6 bf 03 00 00 00 <e8> eb db 2a 02 0f ae 1c 24 81 0c 24 40 80 00 00 0f ae 14 24 80 3d

Log EW low S1-L1: https://app.zenhub.com/files/398313496/3ac5b298-bb6b-4f72-a8fe-0362fb4dab78/download

Whenever possible, first analysis of the root cause These errors do not seem to a lack of resources (RAM or CPU) image.png image.png

Bug Generic Definition of Ready (DoR)

Bug Generic Definition of Done (DoD)

w-fsi commented 1 year ago

This error is likely related to the image itself. When executing the binary the segfault occurs as well. Seems however like all dynamic libs that are referenced could be resolved by the system. The L2 images are affected by this issue as well.

Heuristically I find it suspicious that there is a link from lib64 on libs however that target seems not to be existing. The RPMs seems not to provide a RPM folder. It is however out of scope of my knowledge if these IPFs are coming without any libraries. At least the ASP seems to have a wide set of libraries.

We likely need to extend the docker image with some tools to further investigate it. As it was running before and the simulator seems to be working fine, it is unlikely that the Job order contains something that is causing this error. So it is more likely to be the installation of the IPF within the image that is causing the issue.

Woljtek commented 1 year ago

The segfault error is cauded by the missing lib directory. The directory lib should have been deployed by the L2 rpm during the build of the image. However, the installtion the rpm was skipped due to a rpm dependence. So, the image s1_l12:3.5.2-demless cannot work.

We are looking for an other way to workaround the size of S1-L12 SDP. (removing DEM & LUT after rpms installation)

LAQU156 commented 1 year ago

IVV_CCB_2023_w06 : Under Analysis, waiting for an efficient workaround, Priority blocking

SYTHIER-ADS commented 1 year ago

Hi @w-fsi ,

We created a new docker image without DEM and other external data (See PR on rs-config here) We have got exactly the same behavior.

Can you confirm that the following points are considered when building the RS add-on (this is extracted from the Readme of the Docker image): "IMPORTANT:

Looking to the README_dockerfile.rst, we have the following:

How to build docker image

Requirements

Precedure

copy the Dockerfile_ipf00352 near the rpm files

for 3.5.2:

S1PD-IPF-DEM-AUX-2.6.1-00.01.x86_64.rpm S1PD-IPF-LUT_3.4.0-3.4.0-00.01.x86_64.rpm S1PD-IPF-L1_3.5.1-3.5.1-00.01.x86_64.rpm S1PD-IPF-L2_3.5.2-3.5.2-00.01.x86_64.rpm S1PD-IPF-TT_3.5.2-3.5.2-00.01.x86_64.rpm S1PD-IPF-MAIN-CFG_3.5.2-3.5.2-00.01.x86_64.rpm

then run the build command

docker build -f Dockerfile_ipf00352 --force-rm -t ipf_v00352 .

Specific configuration matching the machine hosting the IPF

IMPORTANT:

After the build, the image will have the configuration for the build machine. This configuration must be tuned to the machine actually running the IPF. The point is to update the number of cores to be used by the IPF.

For the ADS / PDGS hosting environement, a configuration script / helper is provided. This uses the name of the machine to derive the number of cores to be used.

For other hosting environement, similar tuning must be done. The configuration script for ADS / PDGS environement can be used as an example.

# this should be done on the PDGS machine once

# run the configuration script
$ docker run --name ipf_cfg --network host -ti bash

[piccontrol@xxxxxxxxxxx ~]$ LANG=C
[piccontrol@xxxxxxxxxxx ~]$ h=$(hostname)
[piccontrol@xxxxxxxxxxx ~]$ center=${h:3:4}
[piccontrol@xxxxxxxxxxx ~]$ # Get the number of sockets
[piccontrol@xxxxxxxxxxx ~]$ SOCKETS=$(lscpu | grep -E "^(CPU )?[Ss]ocket" | tr " " "\n" | tail -n 1)
[piccontrol@xxxxxxxxxxx ~]$ # Get the number of real cores per socket
[piccontrol@xxxxxxxxxxx ~]$ CORES_PER_SOCKET=$(lscpu | grep "Core(s) per socket" | awk '{print $4}')
[piccontrol@xxxxxxxxxxx ~]$ # Get the total number of sockets
[piccontrol@xxxxxxxxxxx ~]$ CORES=`expr $SOCKETS \* $CORES_PER_SOCKET`
[piccontrol@xxxxxxxxxxx ~]$ sudo /usr/local/conf/IPF_CFG/configure_pdgs.sh $center $CORES
[piccontrol@xxxxxxxxxxx ~]$ exit

Please note that, to change the processing center name and location written in the products, the 
processing_configuration.xml file should be modified accordingly.

# commit the configuration in the final docker image
# 
$ docker commit ipf_cfg ipf_v00352

Usage

docker run --rm -v /path:/path --network host ipf_V00352 /usr/local/componants/S1IPF/bin/LOP_ProcMain /path/JobOrder.XXXXXXXXXXX.xml

Thanks for your feedback

w-fsi commented 1 year ago

Werum did not make an integration of the IPF and taking the images as-is with just adding the execution worker into the image plus if required applying work arounds. The Dockerfile is everything that is executed during the build.

SYTHIER-ADS commented 1 year ago

Dear @w-fsi, Understood but it seems that some elements are needed to be done as part of the RS add-on wrapping as it depends on the machine where it will be deployed.

w-fsi commented 1 year ago

Yes, as far as I am aware in the S1Pro context there was a entry point script from Airbus that was doing this kind of checks. I raised this point with Fabien in the meeting two days ago as I assume that the kernel settings from the readme are missing there. This script or settings needs to be contained in the provided base image. All integrative activities needs to be performed within this layer.

Woljtek commented 1 year ago

We have continued our investigation. We finally ran the S1-L1 without the segfaut error. We performed 3 actions on the container:

  1. Run the postdeployment configuration with the root user:
    
    LANG=C
    [root@s1-l1-part1-execution-worker-high ~]# center=rs
    [root@s1-l1-part1-execution-worker-high ~]# SOCKETS=$(lscpu | grep -E "^(CPU )?[Ss]ocket" | tr " " "\n" | tail -n 1)
    [root@s1-l1-part1-execution-worker-high ~]# CORES_PER_SOCKET=$(lscpu | grep "Core(s) per socket" | awk '{print $4}')
    [root@s1-l1-part1-execution-worker-high ~]# CORES=`expr $SOCKETS \* $CORES_PER_SOCKET`
    [root@s1-l1-part1-execution-worker-high ~]# sudo /usr/local/conf/IPF_CFG/configure_pdgs.sh $center $CORES
2. As  pyccontrol, unlimit the stacksize: `[piccontrol@s1-l1-part1-execution-worker-high app]$ unlimit stacksize
`
3. As pyccontrol, start the preproc_main:

[piccontrol@s1-l1-part1-execution-worker-high-v9-595f678b6d-ng88z /app]$ PSC_PreprocMain /data/localWD/51678/JobOrder.51678.xml [piccontrol@s1-l1-part1-execution-worker-high-v9-595f678b6d-ng88z /app]$ echo $? 128


The error 128 is a nominal error (according to S1-L1 ICD):
![image.png](https://images.zenhubusercontent.com/618e932533b15808a281c31c/efd34955-dfbd-4696-823d-40ad546923dd)

While trying step 3 with root user, we reproduced the behavior of the issue:

[piccontrol@s1-l1-part1-execution-worker-high app]$ sudo su [root@s1-l1-part1-execution-worker-high app]# PSC_PreprocMain /data/localWD/51678/JobOrder.51678.xml bash: PSC_PreprocMain: command not found [root@s1-l1-part1-execution-worker-high app]# /usr/local/components/S1IPF/bin/PSC_PreprocMain /data/localWD/51678/JobOrder.51678.xml Segmentation fault (core dumped)

Woljtek commented 1 year ago

I succesfully tested to execute PSC_PreprocMain the with only the step 3. Also, I proposed as workaround to start le IPF with the user piccontrol. I mean to replace rsuser per piccontrol in S1-L1 and S1-L2 dockerfiles: image.png

Woljtek commented 1 year ago

Patch with PR #25 did not work => started by root in /app

2023-02-13T14:48:59.815 | INFO  | e.s.c.i.e.w.Application [main]: Starting Application using Java 11.0.18 on s1-l1-part3-execution-worker-low-v8-df445b8f8-kkl7k with PID 7 (/app/rs-execution-worker.jar started by root in /app)

We need to force the execution of /app/start.sh with the user piccontrol.

test on EW:

[piccontrol@s1-l1-part3-execution-worker-low-v8-df445b8f8-kkl7k /app]$ ls -l
total 88152
-rw-r--r-- 1 piccontrol pic_run       68 Feb 13 14:10 VERSION
-rw-r--r-- 1 root       root      324839 Feb 13 14:52 logfile.log
-rw-r--r-- 1 root       root           0 Feb 13 14:48 report.json
-rw-r--r-- 1 root       root    89928435 Feb  8 09:19 rs-execution-worker.jar
-rwxr-xr-x 1 root       root         343 Feb  8 09:17 start.sh
[piccontrol@s1-l1-part3-execution-worker-low-v8-df445b8f8-kkl7k /app]$ ./start.sh 

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::                (v2.6.6)

2023-02-13 14:53:42.398  INFO 227 --- [           main] e.s.c.i.e.w.Application                  : Starting Application using Java 11.0.18 on s1-l1-part3-execution-worker-low-v8-df445b8f8-kkl7k with PID 227 (/app/rs-execution-worker.jar started by piccontrol in /app)
2023-02-13 14:53:42.407  INFO 227 --- [           main] e.s.c.i.e.w.Application                  : No active profile set, falling back to 1 default profile: "default"
2023-02-13 14:53:43.440  INFO 227 --- [           main] faultConfiguringBeanFactoryPostProcessor : No bean named 'errorChannel' has been explicitly defined. Therefore, a default PublishSubscribeChannel will be created.
2023-02-13 14:53:43.455  INFO 227 --- [           main] faultConfiguringBeanFactoryPostProcessor : No bean named 'integrationHeaderChannelRegistry' has been explicitly defined. Therefore, a default DefaultHeaderChannelRegistry will be created.
2023-02-13 14:53:43.686  INFO 227 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'org.springframework.integration.config.IntegrationManagementConfiguration' of type [org.springframework.integration.config.IntegrationManagementConfiguration] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2023-02-13 14:53:43.694  INFO 227 --- [           main] trationDelegate$BeanPostProcessorChecker : Bean 'integrationChannelResolver' of type [org.springframework.integration.support.channel.BeanFactoryChannelResolver] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2023-02-13 14:53:43.821  INFO 227 --- [           main] o.e.j.u.log                              : Logging initialized @2778ms to org.eclipse.jetty.util.log.Slf4jLog
2023-02-13 14:53:43.981  INFO 227 --- [           main] o.s.b.w.e.j.JettyServletWebServerFactory : Server initialized with port: 8080
2023-02-13 14:53:43.983  INFO 227 --- [           main] o.e.j.s.Server                           : jetty-9.4.45.v20220203; built: 2022-02-03T09:14:34.105Z; git: 4a0c91c0be53805e3fcffdcdcc9587d5301863db; jvm 11.0.18+10-LTS
2023-02-13 14:53:44.013  INFO 227 --- [           main] o.e.j.s.h.C.application                  : Initializing Spring embedded WebApplicationContext
2023-02-13 14:53:44.013  INFO 227 --- [           main] w.s.c.ServletWebServerApplicationContext : Root WebApplicationContext: initialization completed in 1555 ms
2023-02-13 14:53:44.253  INFO 227 --- [           main] o.e.j.s.session                          : DefaultSessionIdManager workerName=node0
2023-02-13 14:53:44.254  INFO 227 --- [           main] o.e.j.s.session                          : No SessionScavenger set, using defaults
2023-02-13 14:53:44.255  INFO 227 --- [           main] o.e.j.s.session                          : node0 Scavenging every 600000ms
2023-02-13 14:53:44.264  INFO 227 --- [           main] o.e.j.s.h.ContextHandler                 : Started o.s.b.w.e.j.JettyEmbeddedWebAppContext@7eb01b12{application,/,[file:///tmp/jetty-docbase.8080.500386532811423305/],AVAILABLE}
2023-02-13 14:53:44.265  INFO 227 --- [           main] o.e.j.s.Server                           : Started @3222ms
2023-02-13 14:53:44.304  WARN 227 --- [           main] ConfigServletWebServerApplicationContext : Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'executionWorkerServiceConfiguration': Unsatisfied dependency expressed through field 'obsClient'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'newObsClient' defined in class path resource [esa/s1pdgs/cpoc/obs_sdk/ObsConfiguration.class]: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [esa.s1pdgs.cpoc.obs_sdk.ObsClient]: Factory method 'newObsClient' threw exception; nested exception is java.lang.IllegalArgumentException: Access key cannot be null.
2023-02-13 14:53:44.309  INFO 227 --- [           main] o.e.j.s.session                          : node0 Stopped scavenging
2023-02-13 14:53:44.312  INFO 227 --- [           main] o.e.j.s.h.ContextHandler                 : Stopped o.s.b.w.e.j.JettyEmbeddedWebAppContext@7eb01b12{application,/,[file:///tmp/jetty-docbase.8080.500386532811423305/],STOPPED}
2023-02-13 14:53:44.325  INFO 227 --- [           main] ConditionEvaluationReportLoggingListener : 

Error starting ApplicationContext. To display the conditions report re-run your application with 'debug' enabled.
2023-02-13 14:53:44.350 ERROR 227 --- [           main] o.s.b.SpringApplication                  : Application run failed

org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'executionWorkerServiceConfiguration': Unsatisfied dependency expressed through field 'obsClient'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'newObsClient' defined in class path resource [esa/s1pdgs/cpoc/obs_sdk/ObsConfiguration.class]: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [esa.s1pdgs.cpoc.obs_sdk.ObsClient]: Factory method 'newObsClient' threw exception; nested exception is java.lang.IllegalArgumentException: Access key cannot be null.
    at org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor$AutowiredFieldElement.resolveFieldValue(AutowiredAnnotationBeanPostProcessor.java:659) ~[spring-beans-5.3.18.jar!/:5.3.18]
    at org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor$AutowiredFieldElement.inject(AutowiredAnnotationBeanPostProcessor.java:639) ~[spring-beans-5.3.18.jar!/:5.3.18]
    at org.springframework.beans.factory.annotation.InjectionMetadata.inject(InjectionMetadata.java:119) ~[spring-beans-5.3.18.jar!/:5.3.18]
    at org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor.postProcessProperties(AutowiredAnnotationBeanPostProcessor.java:399) ~[spring-beans-5.3.18.jar!/:5.3.18]
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1431) ~[spring-beans-5.3.18.jar!/:5.3.18]
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:619) ~[spring-beans-5.3.18.jar!/:5.3.18]
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:542) ~[spring-beans-5.3.18.jar!/:5.3.18]
    at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:335) ~[spring-beans-5.3.18.jar!/:5.3.18]
    at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234) ~[spring-beans-5.3.18.jar!/:5.3.18]
Woljtek commented 1 year ago

The last PR successfully works around the segfault error.

2023-02-14T09:45:00.898 | INFO  | e.s.c.i.e.w.Application [main]: Starting Application using Java 11.0.18 on s1-l1-part3-execution-worker-low-v9-5c6456bbd-45j54 with PID 8 (/app/rs-execution-worker.jar started by piccontrol in /app)
2023-02-14T09:45:00.907 | INFO  | e.s.c.i.e.w.Application [main]: No active profile set, falling back to 1 default profile: "default"
2023-02-14T09:45:02.886 | INFO  | e.s.c.o.s.S.Factory [main]: Disable chunked encoding: false
2023-02-14T09:45:03.176 | INFO  | e.s.c.o.s.S.Factory [main]: created transferManager with minimumUploadPartSize: 104857600 multipartUploadThreshold: 3221225472
2023-02-14T09:45:03.179 | INFO  | e.s.c.o.s.S.Factory [main]: created s3ObsServices with maxRetries: 10 retriesDelay: 500 uploadCacheLocation: /tmp

But there are java ObsEmptyFileException in logs, it might be an other issue:

2023-02-14T09:46:27.478 | ERROR | o.s.i.h.LoggingHandler [KafkaConsumerDestination{consumerDestinationName='s1-l1-part3.priority-filter-low', partitions=4, dlqName='error-warning'}.container-0-C-1]: org.springframework.messaging.MessageHandlingException: error occurred in message handler [org.springframework.cloud.stream.function.FunctionConfiguration$FunctionToDestinationBinder$1@63ad1452]; nested exception is java.lang.RuntimeException: esa.s1pdgs.cpoc.obs_sdk.ObsEmptyFileException: Empty file detected: 52172, failedMessage=GenericMessage [payload=byte[29572], headers={deliveryAttempt=3, kafka_timestampType=CREATE_TIME, kafka_receivedTopic=s1-l1-part3.priority-filter-low, target-protocol=kafka, b3=2ba8ec2b925034b2-4359e3144ab15bf8-0, nativeHeaders={b3=[2ba8ec2b925034b2-4359e3144ab15bf8-0]}, kafka_offset=576, scst_nativeHeadersPresent=true, kafka_consumer=org.apache.kafka.clients.consumer.KafkaConsumer@4e665df, kafka_receivedPartitionId=2, contentType=application/json, kafka_receivedTimestamp=1676300304658, kafka_groupId=s1-l1-part3}]
        at org.springframework.integration.support.utils.IntegrationUtils.wrapInHandlingExceptionIfNecessary(IntegrationUtils.java:191)
        at org.springframework.integration.handler.AbstractMessageHandler.handleMessage(AbstractMessageHandler.java:65)
        at org.springframework.integration.dispatcher.AbstractDispatcher.tryOptimizedDispatch(AbstractDispatcher.java:115)
        at org.springframework.integration.dispatcher.UnicastingDispatcher.doDispatch(UnicastingDispatcher.java:133)
        ...
Caused by: java.lang.RuntimeException: esa.s1pdgs.cpoc.obs_sdk.ObsEmptyFileException: Empty file detected: 52172
        at esa.s1pdgs.cpoc.ipf.execution.worker.service.ExecutionWorkerService.apply(ExecutionWorkerService.java:253)
        at esa.s1pdgs.cpoc.ipf.execution.worker.service.ExecutionWorkerService.apply(ExecutionWorkerService.java:87)
    ...

I am going to open an other issue.

LAQU156 commented 1 year ago

IVV_CCB_2023_w07 : Accepted Werum, Priority minor (workaround), need to be fixed by automatization of the workaround

LAQU156 commented 1 year ago

Werum_CCB_2023_w07 : Product Backlog, Deeper analysis needed for next CCB

Woljtek commented 1 year ago

@w-fsi This issue is lost!

Indeed, it was moved in Done but never delivered.

I think the root cause is the PR S1-L1: Change podSecurityContext to use piccontrol", reviewed but not merge. Because of the WA is deployed on OPS, we dont need this fix quickly.

w-fsi commented 1 year ago

PR was merged into delivery for 1.12.1

pcuq-ads commented 1 year ago

SYS_CCB_2023_w13 : the ticket can be closed.