Closed Woljtek closed 1 year ago
This error is likely related to the image itself. When executing the binary the segfault occurs as well. Seems however like all dynamic libs that are referenced could be resolved by the system. The L2 images are affected by this issue as well.
Heuristically I find it suspicious that there is a link from lib64 on libs however that target seems not to be existing. The RPMs seems not to provide a RPM folder. It is however out of scope of my knowledge if these IPFs are coming without any libraries. At least the ASP seems to have a wide set of libraries.
We likely need to extend the docker image with some tools to further investigate it. As it was running before and the simulator seems to be working fine, it is unlikely that the Job order contains something that is causing this error. So it is more likely to be the installation of the IPF within the image that is causing the issue.
The segfault error is cauded by the missing lib directory. The directory lib should have been deployed by the L2 rpm during the build of the image. However, the installtion the rpm was skipped due to a rpm dependence. So, the image s1_l12:3.5.2-demless cannot work.
We are looking for an other way to workaround the size of S1-L12 SDP. (removing DEM & LUT after rpms installation)
IVV_CCB_2023_w06 : Under Analysis, waiting for an efficient workaround, Priority blocking
Hi @w-fsi ,
We created a new docker image without DEM and other external data (See PR on rs-config here) We have got exactly the same behavior.
Can you confirm that the following points are considered when building the RS add-on (this is extracted from the Readme of the Docker image): "IMPORTANT:
Looking to the README_dockerfile.rst, we have the following:
docker info | grep "Docker Root Dir"
)copy the Dockerfile_ipf00352 near the rpm files
for 3.5.2:
S1PD-IPF-DEM-AUX-2.6.1-00.01.x86_64.rpm S1PD-IPF-LUT_3.4.0-3.4.0-00.01.x86_64.rpm S1PD-IPF-L1_3.5.1-3.5.1-00.01.x86_64.rpm S1PD-IPF-L2_3.5.2-3.5.2-00.01.x86_64.rpm S1PD-IPF-TT_3.5.2-3.5.2-00.01.x86_64.rpm S1PD-IPF-MAIN-CFG_3.5.2-3.5.2-00.01.x86_64.rpm
then run the build command
docker build -f Dockerfile_ipf00352 --force-rm -t ipf_v00352 .
IMPORTANT:
After the build, the image will have the configuration for the build machine. This configuration must be tuned to the machine actually running the IPF. The point is to update the number of cores to be used by the IPF.
For the ADS / PDGS hosting environement, a configuration script / helper is provided. This uses the name of the machine to derive the number of cores to be used.
For other hosting environement, similar tuning must be done. The configuration script for ADS / PDGS environement can be used as an example.
# this should be done on the PDGS machine once
# run the configuration script
$ docker run --name ipf_cfg --network host -ti bash
[piccontrol@xxxxxxxxxxx ~]$ LANG=C
[piccontrol@xxxxxxxxxxx ~]$ h=$(hostname)
[piccontrol@xxxxxxxxxxx ~]$ center=${h:3:4}
[piccontrol@xxxxxxxxxxx ~]$ # Get the number of sockets
[piccontrol@xxxxxxxxxxx ~]$ SOCKETS=$(lscpu | grep -E "^(CPU )?[Ss]ocket" | tr " " "\n" | tail -n 1)
[piccontrol@xxxxxxxxxxx ~]$ # Get the number of real cores per socket
[piccontrol@xxxxxxxxxxx ~]$ CORES_PER_SOCKET=$(lscpu | grep "Core(s) per socket" | awk '{print $4}')
[piccontrol@xxxxxxxxxxx ~]$ # Get the total number of sockets
[piccontrol@xxxxxxxxxxx ~]$ CORES=`expr $SOCKETS \* $CORES_PER_SOCKET`
[piccontrol@xxxxxxxxxxx ~]$ sudo /usr/local/conf/IPF_CFG/configure_pdgs.sh $center $CORES
[piccontrol@xxxxxxxxxxx ~]$ exit
Please note that, to change the processing center name and location written in the products, the
processing_configuration.xml file should be modified accordingly.
# commit the configuration in the final docker image
#
$ docker commit ipf_cfg ipf_v00352
docker run --rm -v /path:/path --network host ipf_V00352 /usr/local/componants/S1IPF/bin/LOP_ProcMain /path/JobOrder.XXXXXXXXXXX.xml
Thanks for your feedback
Werum did not make an integration of the IPF and taking the images as-is with just adding the execution worker into the image plus if required applying work arounds. The Dockerfile is everything that is executed during the build.
Dear @w-fsi, Understood but it seems that some elements are needed to be done as part of the RS add-on wrapping as it depends on the machine where it will be deployed.
Yes, as far as I am aware in the S1Pro context there was a entry point script from Airbus that was doing this kind of checks. I raised this point with Fabien in the meeting two days ago as I assume that the kernel settings from the readme are missing there. This script or settings needs to be contained in the provided base image. All integrative activities needs to be performed within this layer.
We have continued our investigation. We finally ran the S1-L1 without the segfaut error. We performed 3 actions on the container:
LANG=C
[root@s1-l1-part1-execution-worker-high ~]# center=rs
[root@s1-l1-part1-execution-worker-high ~]# SOCKETS=$(lscpu | grep -E "^(CPU )?[Ss]ocket" | tr " " "\n" | tail -n 1)
[root@s1-l1-part1-execution-worker-high ~]# CORES_PER_SOCKET=$(lscpu | grep "Core(s) per socket" | awk '{print $4}')
[root@s1-l1-part1-execution-worker-high ~]# CORES=`expr $SOCKETS \* $CORES_PER_SOCKET`
[root@s1-l1-part1-execution-worker-high ~]# sudo /usr/local/conf/IPF_CFG/configure_pdgs.sh $center $CORES
2. As pyccontrol, unlimit the stacksize: `[piccontrol@s1-l1-part1-execution-worker-high app]$ unlimit stacksize
`
3. As pyccontrol, start the preproc_main:
[piccontrol@s1-l1-part1-execution-worker-high-v9-595f678b6d-ng88z /app]$ PSC_PreprocMain /data/localWD/51678/JobOrder.51678.xml [piccontrol@s1-l1-part1-execution-worker-high-v9-595f678b6d-ng88z /app]$ echo $? 128
The error 128 is a nominal error (according to S1-L1 ICD):
![image.png](https://images.zenhubusercontent.com/618e932533b15808a281c31c/efd34955-dfbd-4696-823d-40ad546923dd)
While trying step 3 with root user, we reproduced the behavior of the issue:
[piccontrol@s1-l1-part1-execution-worker-high app]$ sudo su [root@s1-l1-part1-execution-worker-high app]# PSC_PreprocMain /data/localWD/51678/JobOrder.51678.xml bash: PSC_PreprocMain: command not found [root@s1-l1-part1-execution-worker-high app]# /usr/local/components/S1IPF/bin/PSC_PreprocMain /data/localWD/51678/JobOrder.51678.xml Segmentation fault (core dumped)
Patch with PR #25 did not work => started by root in /app
2023-02-13T14:48:59.815 | INFO | e.s.c.i.e.w.Application [main]: Starting Application using Java 11.0.18 on s1-l1-part3-execution-worker-low-v8-df445b8f8-kkl7k with PID 7 (/app/rs-execution-worker.jar started by root in /app)
We need to force the execution of /app/start.sh
with the user piccontrol
.
test on EW:
[piccontrol@s1-l1-part3-execution-worker-low-v8-df445b8f8-kkl7k /app]$ ls -l
total 88152
-rw-r--r-- 1 piccontrol pic_run 68 Feb 13 14:10 VERSION
-rw-r--r-- 1 root root 324839 Feb 13 14:52 logfile.log
-rw-r--r-- 1 root root 0 Feb 13 14:48 report.json
-rw-r--r-- 1 root root 89928435 Feb 8 09:19 rs-execution-worker.jar
-rwxr-xr-x 1 root root 343 Feb 8 09:17 start.sh
[piccontrol@s1-l1-part3-execution-worker-low-v8-df445b8f8-kkl7k /app]$ ./start.sh
. ____ _ __ _ _
/\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
\\/ ___)| |_)| | | | | || (_| | ) ) ) )
' |____| .__|_| |_|_| |_\__, | / / / /
=========|_|==============|___/=/_/_/_/
:: Spring Boot :: (v2.6.6)
2023-02-13 14:53:42.398 INFO 227 --- [ main] e.s.c.i.e.w.Application : Starting Application using Java 11.0.18 on s1-l1-part3-execution-worker-low-v8-df445b8f8-kkl7k with PID 227 (/app/rs-execution-worker.jar started by piccontrol in /app)
2023-02-13 14:53:42.407 INFO 227 --- [ main] e.s.c.i.e.w.Application : No active profile set, falling back to 1 default profile: "default"
2023-02-13 14:53:43.440 INFO 227 --- [ main] faultConfiguringBeanFactoryPostProcessor : No bean named 'errorChannel' has been explicitly defined. Therefore, a default PublishSubscribeChannel will be created.
2023-02-13 14:53:43.455 INFO 227 --- [ main] faultConfiguringBeanFactoryPostProcessor : No bean named 'integrationHeaderChannelRegistry' has been explicitly defined. Therefore, a default DefaultHeaderChannelRegistry will be created.
2023-02-13 14:53:43.686 INFO 227 --- [ main] trationDelegate$BeanPostProcessorChecker : Bean 'org.springframework.integration.config.IntegrationManagementConfiguration' of type [org.springframework.integration.config.IntegrationManagementConfiguration] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2023-02-13 14:53:43.694 INFO 227 --- [ main] trationDelegate$BeanPostProcessorChecker : Bean 'integrationChannelResolver' of type [org.springframework.integration.support.channel.BeanFactoryChannelResolver] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2023-02-13 14:53:43.821 INFO 227 --- [ main] o.e.j.u.log : Logging initialized @2778ms to org.eclipse.jetty.util.log.Slf4jLog
2023-02-13 14:53:43.981 INFO 227 --- [ main] o.s.b.w.e.j.JettyServletWebServerFactory : Server initialized with port: 8080
2023-02-13 14:53:43.983 INFO 227 --- [ main] o.e.j.s.Server : jetty-9.4.45.v20220203; built: 2022-02-03T09:14:34.105Z; git: 4a0c91c0be53805e3fcffdcdcc9587d5301863db; jvm 11.0.18+10-LTS
2023-02-13 14:53:44.013 INFO 227 --- [ main] o.e.j.s.h.C.application : Initializing Spring embedded WebApplicationContext
2023-02-13 14:53:44.013 INFO 227 --- [ main] w.s.c.ServletWebServerApplicationContext : Root WebApplicationContext: initialization completed in 1555 ms
2023-02-13 14:53:44.253 INFO 227 --- [ main] o.e.j.s.session : DefaultSessionIdManager workerName=node0
2023-02-13 14:53:44.254 INFO 227 --- [ main] o.e.j.s.session : No SessionScavenger set, using defaults
2023-02-13 14:53:44.255 INFO 227 --- [ main] o.e.j.s.session : node0 Scavenging every 600000ms
2023-02-13 14:53:44.264 INFO 227 --- [ main] o.e.j.s.h.ContextHandler : Started o.s.b.w.e.j.JettyEmbeddedWebAppContext@7eb01b12{application,/,[file:///tmp/jetty-docbase.8080.500386532811423305/],AVAILABLE}
2023-02-13 14:53:44.265 INFO 227 --- [ main] o.e.j.s.Server : Started @3222ms
2023-02-13 14:53:44.304 WARN 227 --- [ main] ConfigServletWebServerApplicationContext : Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'executionWorkerServiceConfiguration': Unsatisfied dependency expressed through field 'obsClient'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'newObsClient' defined in class path resource [esa/s1pdgs/cpoc/obs_sdk/ObsConfiguration.class]: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [esa.s1pdgs.cpoc.obs_sdk.ObsClient]: Factory method 'newObsClient' threw exception; nested exception is java.lang.IllegalArgumentException: Access key cannot be null.
2023-02-13 14:53:44.309 INFO 227 --- [ main] o.e.j.s.session : node0 Stopped scavenging
2023-02-13 14:53:44.312 INFO 227 --- [ main] o.e.j.s.h.ContextHandler : Stopped o.s.b.w.e.j.JettyEmbeddedWebAppContext@7eb01b12{application,/,[file:///tmp/jetty-docbase.8080.500386532811423305/],STOPPED}
2023-02-13 14:53:44.325 INFO 227 --- [ main] ConditionEvaluationReportLoggingListener :
Error starting ApplicationContext. To display the conditions report re-run your application with 'debug' enabled.
2023-02-13 14:53:44.350 ERROR 227 --- [ main] o.s.b.SpringApplication : Application run failed
org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'executionWorkerServiceConfiguration': Unsatisfied dependency expressed through field 'obsClient'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'newObsClient' defined in class path resource [esa/s1pdgs/cpoc/obs_sdk/ObsConfiguration.class]: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [esa.s1pdgs.cpoc.obs_sdk.ObsClient]: Factory method 'newObsClient' threw exception; nested exception is java.lang.IllegalArgumentException: Access key cannot be null.
at org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor$AutowiredFieldElement.resolveFieldValue(AutowiredAnnotationBeanPostProcessor.java:659) ~[spring-beans-5.3.18.jar!/:5.3.18]
at org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor$AutowiredFieldElement.inject(AutowiredAnnotationBeanPostProcessor.java:639) ~[spring-beans-5.3.18.jar!/:5.3.18]
at org.springframework.beans.factory.annotation.InjectionMetadata.inject(InjectionMetadata.java:119) ~[spring-beans-5.3.18.jar!/:5.3.18]
at org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor.postProcessProperties(AutowiredAnnotationBeanPostProcessor.java:399) ~[spring-beans-5.3.18.jar!/:5.3.18]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1431) ~[spring-beans-5.3.18.jar!/:5.3.18]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:619) ~[spring-beans-5.3.18.jar!/:5.3.18]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:542) ~[spring-beans-5.3.18.jar!/:5.3.18]
at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:335) ~[spring-beans-5.3.18.jar!/:5.3.18]
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234) ~[spring-beans-5.3.18.jar!/:5.3.18]
The last PR successfully works around the segfault error.
2023-02-14T09:45:00.898 | INFO | e.s.c.i.e.w.Application [main]: Starting Application using Java 11.0.18 on s1-l1-part3-execution-worker-low-v9-5c6456bbd-45j54 with PID 8 (/app/rs-execution-worker.jar started by piccontrol in /app)
2023-02-14T09:45:00.907 | INFO | e.s.c.i.e.w.Application [main]: No active profile set, falling back to 1 default profile: "default"
2023-02-14T09:45:02.886 | INFO | e.s.c.o.s.S.Factory [main]: Disable chunked encoding: false
2023-02-14T09:45:03.176 | INFO | e.s.c.o.s.S.Factory [main]: created transferManager with minimumUploadPartSize: 104857600 multipartUploadThreshold: 3221225472
2023-02-14T09:45:03.179 | INFO | e.s.c.o.s.S.Factory [main]: created s3ObsServices with maxRetries: 10 retriesDelay: 500 uploadCacheLocation: /tmp
But there are java ObsEmptyFileException in logs, it might be an other issue:
2023-02-14T09:46:27.478 | ERROR | o.s.i.h.LoggingHandler [KafkaConsumerDestination{consumerDestinationName='s1-l1-part3.priority-filter-low', partitions=4, dlqName='error-warning'}.container-0-C-1]: org.springframework.messaging.MessageHandlingException: error occurred in message handler [org.springframework.cloud.stream.function.FunctionConfiguration$FunctionToDestinationBinder$1@63ad1452]; nested exception is java.lang.RuntimeException: esa.s1pdgs.cpoc.obs_sdk.ObsEmptyFileException: Empty file detected: 52172, failedMessage=GenericMessage [payload=byte[29572], headers={deliveryAttempt=3, kafka_timestampType=CREATE_TIME, kafka_receivedTopic=s1-l1-part3.priority-filter-low, target-protocol=kafka, b3=2ba8ec2b925034b2-4359e3144ab15bf8-0, nativeHeaders={b3=[2ba8ec2b925034b2-4359e3144ab15bf8-0]}, kafka_offset=576, scst_nativeHeadersPresent=true, kafka_consumer=org.apache.kafka.clients.consumer.KafkaConsumer@4e665df, kafka_receivedPartitionId=2, contentType=application/json, kafka_receivedTimestamp=1676300304658, kafka_groupId=s1-l1-part3}]
at org.springframework.integration.support.utils.IntegrationUtils.wrapInHandlingExceptionIfNecessary(IntegrationUtils.java:191)
at org.springframework.integration.handler.AbstractMessageHandler.handleMessage(AbstractMessageHandler.java:65)
at org.springframework.integration.dispatcher.AbstractDispatcher.tryOptimizedDispatch(AbstractDispatcher.java:115)
at org.springframework.integration.dispatcher.UnicastingDispatcher.doDispatch(UnicastingDispatcher.java:133)
...
Caused by: java.lang.RuntimeException: esa.s1pdgs.cpoc.obs_sdk.ObsEmptyFileException: Empty file detected: 52172
at esa.s1pdgs.cpoc.ipf.execution.worker.service.ExecutionWorkerService.apply(ExecutionWorkerService.java:253)
at esa.s1pdgs.cpoc.ipf.execution.worker.service.ExecutionWorkerService.apply(ExecutionWorkerService.java:87)
...
I am going to open an other issue.
IVV_CCB_2023_w07 : Accepted Werum, Priority minor (workaround), need to be fixed by automatization of the workaround
Werum_CCB_2023_w07 : Product Backlog, Deeper analysis needed for next CCB
@w-fsi This issue is lost!
Indeed, it was moved in Done but never delivered.
I think the root cause is the PR S1-L1: Change podSecurityContext to use piccontrol", reviewed but not merge. Because of the WA is deployed on OPS, we dont need this fix quickly.
PR was merged into delivery for 1.12.1
SYS_CCB_2023_w13 : the ticket can be closed.
Environment:
Relative user Stories:
431 Create S3_L1 processor as RS add-on
432 Create S3_L2 processor as RS add-on
Current Behavior: All job received by S1_L1 ew ends with the error 139:
On node, a log indicates a segmentation fault of PSC_PreprocMain at the same moment.:
Expected Behavior: The job shall produces S1_L1 products or generates a functional error (127 & 128 according to S1-L1 ICD MPC-265 v1.12).
Steps To Reproduce: Systematic
Test execution artefacts (i.e. logs, screenshots…)
Extract syslog of node-104:
Log EW low S1-L1: https://app.zenhub.com/files/398313496/3ac5b298-bb6b-4f72-a8fe-0362fb4dab78/download
Whenever possible, first analysis of the root cause These errors do not seem to a lack of resources (RAM or CPU)
Bug Generic Definition of Ready (DoR)
Bug Generic Definition of Done (DoD)