COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG] [OPS] Metadata Catalog Extraction 'Error on downloading from OBS' for S3 AUX #610

Closed suberti-ads closed 1 year ago

suberti-ads commented 1 year ago

Environment:

Current Behavior: For each AUX S3 OPER_AUX_GNSSRD_POD data, we observed that in te metadata extraction-worker:

To conclude, we observed multiple download retries and error even if the product has been successfully retrieved to the pod.

Expected Behavior: Metadata shall be extracted from AUX_GNSSRD_POD products. A clear error should be raised in this error case.

Steps To Reproduce: Ingest a S3 AUX_GNSSRD_POD aux files using the RS Core Ingestion (AGS-AUXIP conf) on S3 workflow.

Whenever possible, first analysis of the root cause It seems that DBL files are considered as folder.

Note: It is not a no left space issue on pod.

Bug Generic Definition of Ready (DoR)

Bug Generic Definition of Done (DoD)

Woljtek commented 1 year ago

Hi @w-fsi , Could you have a look to this issue?

w-fsi commented 1 year ago

Can you please provide us the full log of the metadata extract as from the snippets previous decisions made by the extraction are not visible. We normally need the full log to have a chance to see how the incoming event is looking like.

The product type there is currently not in use and not know to us. Workaround shall be to modify the regexp to exclude it.

The trace message is referencing the bucket "rs-aux" that does not exist in the configuration. This is strange and looks like config and logs are not matching.

w-jka commented 1 year ago

The provided configuration for the AUXIP contains an inbox (inbox4) that uses a regexp not matching for Sentinel-3 auxiliary products. The configured inbox5 is enough to catch all Sentinel-3 auxiliary products, therefore we propose to delete inbox4 from the configuration (line 45 -51 in the provided link on the issue)

wruffine-csgroup commented 1 year ago

@w-jka Thank you for this information. The configuration shall be modified

@w-fsi To my understanding, the current exclude regex for the auxip must be set to $a as a workaround for another issue. Can it be changed to exclude these files and still work as a workaround for issue #549 ?

I may be extremely wrong, but would it be possible for the error Directory creation fails [...] to be due to the fact that the files are already present?

w-fsi commented 1 year ago

Yes "$a" is a workaround as hopefully there is no string that contains after its end an "a". Basically this could be used for it, but as long as its not understood why this is not working correctly, I would recommend to edit the match pattern instead.

The Directory creation fails is likely caused by something else. We didn't perform an analyse on it yet, but I think it is not detected as a directory product (containing two files), but as two products and thus running into an issue. The recommended workaround there is to exclude the products as they are not used (and maybe not even valid) S3 products. So the recommendation of @w-jka shall remove the blocking from the issue.

pcuq-ads commented 1 year ago

Hello, This morning (10/10 @10:14), the problem remains. Here is one of the 2897 errors for the last 2 days :

Metadata extraction failed: java.lang.RuntimeException: java.lang.RuntimeException: Error: Number of retries has exceeded while performing Download of metadata file S3A_OPER_AUX_GNSSRD_POD__20220912T033337_V20220904T235942_202209

The lag continue to increase. image.png

Woljtek commented 1 year ago

Dears @wruffine-csgroup & @vvernet-csgroup ,

In the configuration of SCDF, it seems the inbox4 is still enabled: image.png source: https://processing.platform.ops-csc.com/dashboard/#/streams/list/ingestion-aux-part1

As proposed by @w-jka , could you redeploy the stream without inbox4 in order to see if the root cause of the bug is the ingestion of OPER_AUX_GNSSRD_PODaux?

w-fsi commented 1 year ago

Hi @pcuq-ads , if you add the workaround that was proposed it shall not happen anymore that the invalid product is ingested into the system. Did you clean however on Friday the topics as well? If they request is in there, the MDC will consume it as they are not processed successful and thus might explain your observation.

Woljtek commented 1 year ago

The workaround proposed by @w-jka is not sufficient. Indeed, there are still blocking messages in the lag.

Below, we can see that the extraction is blocking at a current-offset of 117: image.png

I rerstarted Yesterday (undeployed/deployed) the stream.

I observe dthat metadata-catalog-part1.metadata-filter topic is not connected to its expected consumer: extraction-worker. During the restart, this error happens:

{"header":{"type":"LOG","timestamp":"2022-10-10T16:24:30.279075Z","level":"ERROR","line":250,"file":"LogAccessor.java","thread":"KafkaConsumerDestination{consumerDestinationName='metadata-catalog-part1.metadata-filter', partitions=1, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"org.springframework.messaging.MessageHandlingException: error occurred in message handler [org.springframework.cloud.stream.function.FunctionConfiguration$FunctionToDestinationBinder$1@1cdcda59]; nested exception is java.lang.NullPointerException: Name is null, failedMessage=GenericMessage [payload=byte[951], headers={deliveryAttempt=3, kafka_timestampType=CREATE_TIME, kafka_receivedTopic=metadata-catalog-part1.metadata-filter, target-protocol=kafka, b3=616d82dc121dabf3-3b38edc97a2ad8a4-0, nativeHeaders={b3=[616d82dc121dabf3-3b38edc97a2ad8a4-0]}, kafka_offset=117, scst_nativeHeadersPresent=true, kafka_consumer=org.apache.kafka.clients.consumer.KafkaConsumer@398080d8, kafka_receivedPartitionId=0, contentType=application/json, kafka_receivedTimestamp=1664961748834, kafka_groupId=metadata-catalog-part1}]\n\tat org.springframework.integration.support.utils.IntegrationUtils.wrapInHandlingExceptionIfNecessary(IntegrationUtils.java:191)\n\tat org.springframework.integration.handler.AbstractMessageHandler.handleMessage(AbstractMessageHandler.java:65)\n\tat org.springframework.integration.dispatcher.AbstractDispatcher.tryOptimizedDispatch(AbstractDispatcher.java:115)\n\tat org.springframework.integration.dispatcher.UnicastingDispatcher.doDispatch(UnicastingDispatcher.java:133)\n\tat org.springframework.integration.dispatcher.UnicastingDispatcher.dispatch(UnicastingDispatcher.java:106)\n\tat org.springframework.integration.channel.AbstractSubscribableChannel.doSend(AbstractSubscribableChannel.java:72)\n\tat org.springframework.integration.channel.AbstractMessageChannel.send(AbstractMessageChannel.java:317)\n\tat org.springframework.integration.channel.AbstractMessageChannel.send(AbstractMessageChannel.java:272)
[...]

@praynaud-ads :

w-fsi commented 1 year ago

Can you please provide us the name of these products that are causing the ongoing issues? The original GPOD shall not be ingested anymore.

praynaud-ads commented 1 year ago

Logs of metadata extraction : logs_metadata_extraction.txt

w-fsi commented 1 year ago

@praynaud-ads : please verify, if on the error-warnings and DLQ log there are any findings of affected products like "S3A_OPER_AUX_GNSSRD_POD__20220914T033340_V20220906T235942_20220907T235941". We assume that the DLQ is restarting them as it is configured to restart these kind of issues:

app.dlq-manager.dlq-manager.routing.obs4.errorTitle=Generic OBS issue
app.dlq-manager.dlq-manager.routing.obs4.errorID=.*ObsServiceException.*
app.dlq-manager.dlq-manager.routing.obs4.actionType=Restart
#app.dlq-manager.dlq-manager.routing.obs4.targetTopic=
app.dlq-manager.dlq-manager.routing.obs4.maxRetry=2
app.dlq-manager.dlq-manager.routing.obs4.priority=100

So you'll likely find this product also multiple times on the kafka topic?

praynaud-ads commented 1 year ago

Indeed, DLQ Manager logs contains some products like "S3A_OPER_AUX_GNSSRD_POD..." .

You can see the logs below : logs_dlq_manager.txt

pcuq-ads commented 1 year ago

Following discussion with CS & WERUML, we need to update the configuration with following changes :

w-jka commented 1 year ago

Please note, that the timeout for kafka is set to 5 minutes by default. The more important configuration parameter is the max.poll.records. The batch size of 500 is too big and results in the consumer getting kicked out of the consumer group, especially when an error happens during download of the manifest file (which happens in the erroneous messages for the S3AOPER... product).

As it was detected, that a failing download results in a duration of roughly 1 minute per message, the maximum batch size shall be at maximum 4. This should enforce, that the new kafka offset will be communicated to the kafka broker, and the consumer will not be kicked out of the consumer group.

pcuq-ads commented 1 year ago

OK so first configuration to be applied is

We will handle DLQ on a second time.

LAQU156 commented 1 year ago

IVV_CCB_2022_w41 : Closed, remaining to do is covered by #557