COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG] [TRACE] [CATALOG] The Metadata Extraction catalog does not generate END trace for product S2B_OPER_MSI_L0__GR* #830

Closed pcuq-ads closed 1 year ago

pcuq-ads commented 1 year ago

Environment:

Traçability:

Current Behavior: The Metadata Extraction catalog does not generate END trace for product S2B_OPER_MSI_L0__GR

Expected Behavior: The Metadata Extraction catalog shall generate BEGIN and END trace for every activity. The impact is that the product does not appear on the Product table and we can note compute the performance.

Steps To Reproduce: Check BEGIN and END trace count over time for task.name=MetadataExtraction .

Test execution artefacts (i.e. logs, screenshots…) If we count trace over time, we see that the BEGIN/END count diverges during S2 production. image.png

We do not see END trace on these period for S2 (even with status NOK). image.png

If we filter on mission S1 and S3, everything is OK. (OUF !)

Whenever possible, first analysis of the root cause

Bug Generic Definition of Ready (DoR)

Bug Generic Definition of Done (DoD)

w-fsi commented 1 year ago

@pcuq-ads : Just to be sure you are referring here to the message "END metadata extraction"? I just checked the code and there is no specific handling for S2, so the behaviour across the mission shall be the same. Did you check if for the affected products there is an error raised? Because the only option I see for this scenario is that some unexpected exception occurred during the metadata extraction itself.

In this case, I need some log file from the extraction itself however to check why it occured. Is this just affecting this product type or also other S2 products?

Be aware that with the delivery today a version is delivered that will change the way how S2 metadata is extracted. It might be that it solves on its own by it, I still would like to understand what exactly was going wrong during the extraction.

pcuq-ads commented 1 year ago

@w-fsi: I check again the issue and I confirm the issue. I do not filter on status OK or NOK.

If we count events BEGIN and END for the service, we see a huge discrepancy for Sentinel-2 product to be stored on catalog. image.png

The same graph on Monday (dedicated to S1 production). image.png

The same graph on Wednesday (dedicated to Sentinel-3), image.png There is a gap at 00:00 , I can check but for sure it is about remaining activities on S2 the day before.

pcuq-ads commented 1 year ago

The impact is that we do not have information on database about S2 production. Blocking for PI computing.

We see errors only about SADATA, around 60 on the same period. The issue deal with 15K error on the same period. image.png

pcuq-ads commented 1 year ago

I check with LOKI the log. {pod="metadata-catalog-part1-metadata-extraction-v7-547885959b-bpmzq"} |~ ERROR|WARN|WARNING != SADATA If I remove SADATA issues (60), there is no ERROR remaining on the Sentinel-2 days. Only warning. Here : https://app.zenhub.com/files/398313496/1b8554b9-d5cd-4ad3-8543-2185fb28b846/download

Again the catalog service start with a BEGIN and do not provide any END trace for 15K messages on the day with Level-0 S2 production.

w-fsi commented 1 year ago

Hm, I am suprised that there are no error traces existing.

When checking the log I see some warning that looks like what I was searching for:

2023-02-14T19:09:38Z    {"header":{"type":"LOG","timestamp":"2023-02-14T19:09:38.244165Z","level":"WARN","line":36,"file":"Retries.java","thread":"KafkaConsumerDestination{consumerDestinationName='metadata-catalog-part1.metadata-filter', partitions=5, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Error on performing Create metadata {\"productFamily\":\"S2_L0_GR\",\"creationTime\":\"2023-02-14T18:38:33.000000Z\",\"insertionTime\":\"2023-02-14T19:09:28.227500Z\",\"instrumentShortName\":\"L0_\",\"productGroupId\":\"S2A_OPER_MSI_L0__DS_REFS_20230214T183833_S20230213T233437_N02.08\",\"productName\":\"S2A_OPER_MSI_L0__GR_REFS_20230214T183833_S20230213T233451_D09_N02.08\",\"orbitNumber\":\"\",\"qualityInfo\":0,\"platformShortName\":\"SENTINEL-2\",\"qualityStatus\":\"NOMINAL\",\"startTime\":\"2023-02-13T23:34:51.894000Z\",\"stopTime\":\"2023-02-13T23:34:51.894000Z\",\"productType\":\"MSI_L0__GR\",\"productClass\":\"OPER\",\"missionId\":\"S2\",\"satelliteId\":\"A\",\"relativeOrbitNumber\":\"\",\"coordinates\":{\"orientation\":\"counterclockwise\",\"coordinates\":[[[159.357627802118,-29.9518075271036],[159.260732838059,-30.3161089778449],[159.521680254522,-30.373028330575],[159.616816999806,-30.0078386285303],[159.357627802118,-29.9518075271036]]],\"type\":\"polygon\"},\"processorVersion\":\"\",\"t0PdgsDate\":\"2023-02-14T00:17:00.000000Z\",\"url\":\"S2A_OPER_MSI_L0__GR_REFS_20230214T183833_S20230213T233451_D09_N02.08\",\"platformSerialIdentifier\":\"S2A\",\"operationalMode\":\"\",\"processingCenter\":\"RS\",\"timeliness\":\"S2_L0\",\"processMode\":\"NOMINAL\"} (1/2), retrying in 1000ms"},"custom":{"logger_string":"esa.s1pdgs.cpoc.common.utils.Retries"}}

The reason for the exception is not clear for me at the moment, but if an exception is thrown, I would expect that it results in an ERROR trace.

w-jka commented 1 year ago

@pcuq-ads Could you please provide an unfiltered extract from the log from around 2023-02-14T19:09:38Z ? We are trying to make sense of what the service is running into as the problem occurs while saving the metadata for the S2_L0_GR in the ES. As the logs are already a week old, we have no other way to take a look at them.

w-fsi commented 1 year ago

@pcuq-ads Never claimed it. But the log is the only hint we are having of what actually happened. It seems like some communication issue with the ES server occured (best guess, missing indices?). When looking at the current errors in the extraction (missing sad index) a report error is generated however.

pcuq-ads commented 1 year ago

Here are the log requested : https://app.zenhub.com/files/398313496/8ece293c-7af7-483c-9588-2a6251fa8e08/download

w-jka commented 1 year ago

@pcuq-ads Thank you for the quick response. I checked the logs, and the WARN that we found earlier, does not seem to be of any issue, as the metadata could be generated as expected. In fact the Log contains 102 BEGIN and 103 END traces, so everything seems to be in line here.

From your graph earlier for the production on Tuesday, it seems that the time frame around 04:00 seems to be interesting, as there is a big discrepancy between BEGIN and END traces. Would it be possible to provide an unfiltered log from that time frame, so we can take a look, if we find a clue there?

Regarding the ERROR messages of the S2_SAD products, the root cause is a missing ES index for the product family. To fix this issue, you need to create the index s2_sad

LAQU156 commented 1 year ago

IVV_CCB_2023_w08 : Moved into "Accepted Werum", First analysis done by Werum team, Priority blocking

pcuq-ads commented 1 year ago

@w-jka : The bug about SADATA is the following one : #829 . I am extracting peace of log from LOKI and I could not find the discrepancies. This is strange. I will try either : 1) to find discrepancies on LOG => issue on Metadata Catalog 2) to find an UID END trace on LOKI and not on TRACE => issue on INFRA

I will do the check and I come back to you.

pcuq-ads commented 1 year ago

I check period from 14/02/2023 @20:00:00 to 14/02/2023 @22:30:00 The following "END" task are present on LOG and not on TRACEs b056cda0-d023-4b03-9c18-593b84257505 d09d7bb3-4166-472a-a1e8-88042fef02c2 0e90c72d-433f-4d21-8623-15d7e537b07d 4d25b259-e8fe-43b0-8107-b570f5b80954 9598a121-6ec7-4d36-98b0-959531291218 e9e047e7-5f13-4c3a-81bb-4802d505e841 3937dcef-1fd4-4cbf-967c-9fa7f98b77ef 86d9acba-7955-4759-83e4-470d953aacfa bfb06e30-c6da-4bb3-8fc8-9022c97607eb df17708b-585b-4d81-8167-7cb13975fbbf 5e4cfd93-1866-4761-a801-3010f4d3023e ca0585ec-83b7-4ac1-a121-fdfd3ea793a1 d9cb705e-d8e3-4933-a433-dea6a3c08278

etc...

Here are the log (REPORT) on LOKI side.

{"header":{"type":"REPORT","timestamp":"2023-02-14T21:17:04.471000Z","level":"INFO","mission":"S2","workflow":"NOMINAL","rs_chain_name":"METADATA","rs_chain_version":"1.8.0-rc1"},"message":{"content":"End metadata extraction"},"task":{"uid":"b056cda0-d023-4b03-9c18-593b84257505","name":"MetadataExtraction","event":"END","status":"OK","output":{"mission_identifier_string":"S2","type_string":"S2_L0_GR","product_sensing_start_date":"2023-02-14T00:23:14.165000Z","product_sensing_stop_date":"2023-02-14T00:23:14.165000Z","timeliness_name_string":"S2_L0","timeliness_value_seconds_integer":4500,"product_metadata_custom_object":{"processing_level_integer":0,"product_group_id":"S2B_OPER_MSI_L0__DS_REFS_20230214T194239_S20230214T002053_N02.08","quality_status_integer":"NOMINAL","orbit_number_integer":"","processor_version_string":"","product_type_string":"MSI_L0__GR","platform_serial_identifier_string":"B","ending_date_time_date":"2023-02-14T00:23:14.165000Z","instrument_short_name_string":"L0_","platform_short_name_string":"SENTINEL-2","beginning_date_time_date":"2023-02-14T00:23:14.165000Z"},"end_to_end_product_boolean":true},"input":{"filename_string":"S2B_OPER_MSI_L0__GR_REFS_20230214T194239_S20230214T002314_D06_N02.08"},"quality":{},"error_code":0,"duration_in_seconds":1.01,"missing_output":[]}}

{"header":{"type":"REPORT","timestamp":"2023-02-14T21:16:51.303000Z","level":"INFO","mission":"S2","workflow":"NOMINAL","rs_chain_name":"METADATA","rs_chain_version":"1.8.0-rc1"},"message":{"content":"End metadata extraction"},"task":{"uid":"d09d7bb3-4166-472a-a1e8-88042fef02c2","name":"MetadataExtraction","event":"END","status":"OK","output":{"mission_identifier_string":"S2","type_string":"S2_L0_GR","product_sensing_start_date":"2023-02-14T00:23:10.557000Z","product_sensing_stop_date":"2023-02-14T00:23:10.557000Z","timeliness_name_string":"S2_L0","timeliness_value_seconds_integer":4500,"product_metadata_custom_object":{"processing_level_integer":0,"product_group_id":"S2B_OPER_MSI_L0__DS_REFS_20230214T194239_S20230214T002053_N02.08","quality_status_integer":"NOMINAL","orbit_number_integer":"","processor_version_string":"","product_type_string":"MSI_L0__GR","platform_serial_identifier_string":"B","ending_date_time_date":"2023-02-14T00:23:10.557000Z","instrument_short_name_string":"L0_","platform_short_name_string":"SENTINEL-2","beginning_date_time_date":"2023-02-14T00:23:10.557000Z"},"end_to_end_product_boolean":true},"input":{"filename_string":"S2B_OPER_MSI_L0__GR_REFS_20230214T194239_S20230214T002310_D05_N02.08"},"quality":{},"error_code":0,"duration_in_seconds":1.012,"missing_output":[]}}

{"header":{"type":"REPORT","timestamp":"2023-02-14T21:16:39.102000Z","level":"INFO","mission":"S2","workflow":"NOMINAL","rs_chain_name":"METADATA","rs_chain_version":"1.8.0-rc1"},"message":{"content":"End metadata extraction"},"task":{"uid":"e66c7204-8974-4a8b-8b3e-e581b6bc96ef","name":"MetadataExtraction","event":"END","status":"OK","output":{"mission_identifier_string":"S2","type_string":"S2_L0_GR","product_sensing_start_date":"2023-02-14T00:21:54.794000Z","product_sensing_stop_date":"2023-02-14T00:21:54.794000Z","timeliness_name_string":"S2_L0","timeliness_value_seconds_integer":4500,"product_metadata_custom_object":{"processing_level_integer":0,"product_group_id":"S2B_OPER_MSI_L0__DS_REFS_20230214T194239_S20230214T002053_N02.08","quality_status_integer":"NOMINAL","orbit_number_integer":"","processor_version_string":"","product_type_string":"MSI_L0__GR","platform_serial_identifier_string":"B","ending_date_time_date":"2023-02-14T00:21:54.794000Z","instrument_short_name_string":"L0_","platform_short_name_string":"SENTINEL-2","beginning_date_time_date":"2023-02-14T00:21:54.794000Z"},"end_to_end_product_boolean":true},"input":{"filename_string":"S2B_OPER_MSI_L0__GR_REFS_20230214T194239_S20230214T002154_D06_N02.08"},"quality":{},"error_code":0,"duration_in_seconds":1.022,"missing_output":[]}}

So this is an infrastructure issue.

LAQU156 commented 1 year ago

IVV_CCB_2023_w08 [EDIT] : Moved into "Accepted CS"

LAQU156 commented 1 year ago

CS_CCB_2023_w08 : Moved into "Product Backlog", highest priority

pcuq-ads commented 1 year ago

Some context :

   "host": "cluster-ops-node-97",
       "container_name": "metadata-catalog-part1-metadata-extraction-v7",
       "docker_id": "e676a70b3f1b4da6ed728146618a72972f312278e498cfdbcdf30556cb2f087c",
       "container_hash": "artifactory.coprs.esa-copernicus.eu/rs-docker/rs-core-metadata-catalog-   extraction@sha256:25fc33d66fb4ae597ba76ee8b618fb17dae7968f05ab9d6dc54a46a41bd39bc2",
       "container_image": "artifactory.coprs.esa-copernicus.eu/rs-docker/rs-core-metadata-catalog-extraction:1.8.0-rc
nleconte-csgroup commented 1 year ago

The END trace is missing in the Elasticsearch because of wrong mapping for an integer field. The field task.output.product_metadata_custom_object.quality_status_integer contains "NOMINAL" : 2023-02-22 16:53:36 +0000 [warn]: #0 fluent/log.rb:351:warn: dump an error event: error_class=Fluent::Plugin::ElasticsearchErrorHandler::ElasticsearchError error="400 - Rejected by Elasticsearch [error type]: mapper_parsing_exception [reason]: 'failed to parse field [task.output.product_metadata_custom_object.quality_status_integer] of type [integer] in document with id 't40LeoYBrZeZNgw8b8kR'. Preview of field's value: 'NOMINAL''" location=nil tag="fluentbit.trace" time=2023-02-22 16:52:44.314186590 +0000 record={"header"=>{"type"=>"REPORT", "timestamp"=>"2023-02-15T16:31:29.467000Z", "level"=>"INFO", "mission"=>"S2", "workflow"=>"NOMINAL", "rs_chain_name"=>"METADATA", "rs_chain_version"=>"1.10.0-rc1"}, "message"=>{"content"=>"End metadata extraction"}, "task"=>{"uid"=>"7c140b82-0dc7-4b0a-bdd6-0f095149c04c", "name"=>"MetadataExtraction", "event"=>"END", "status"=>"OK", "output"=>{"mission_identifier_string"=>"S2", "type_string"=>"S2_L0_GR", "product_sensing_start_date"=>"2022-10-11T06:25:44.474000Z", "product_sensing_stop_date"=>"2022-10-11T06:25:44.474000Z", "timeliness_name_string"=>"S2_L0", "timeliness_value_seconds_integer"=>4500, "product_metadata_custom_object"=>{"product_group_id"=>"S2B_OPER_MSI_L0__DS_REFS_20230215T152221_S20221011T062334_N04.00", "processing_level_integer"=>0, "quality_status_integer"=>"NOMINAL", "orbit_number_integer"=>"", "processor_version_string"=>"", "product_type_string"=>"MSI_L0__GR", "platform_serial_identifier_string"=>"B", "ending_date_time_date"=>"2022-10-11T06:25:44.474000Z", "instrument_short_name_string"=>"MSI", "platform_short_name_string"=>"SENTINEL-2", "beginning_date_time_date"=>"2022-10-11T06:25:44.474000Z"}, "end_to_end_product_boolean"=>true}, "input"=>{"filename_string"=>"S2B_OPER_MSI_L0__GR_REFS_20230215T152221_S20221011T062544_D11_N04.00"}, "quality"=>{}, "error_code"=>0, "duration_in_seconds"=>0.046, "missing_output"=>[]}, "kubernetes"=>{"pod_name"=>"metadata-part1-metadata-extraction-v4-7d8b9d65ff-524t6", "namespace_name"=>"processing", "pod_id"=>"e6a09afa-67d8-41ae-ae75-767f0e3c466f", "labels"=>{"linkerd_io/control-plane-ns"=>"networking", "linkerd_io/proxy-deployment"=>"metadata-part1-metadata-extraction-v4", "linkerd_io/workload-ns"=>"processing", "pod-template-hash"=>"7d8b9d65ff", "role"=>"spring-app", "spring-app-id"=>"metadata-part1-metadata-extraction-v4", "spring-application-name"=>"metadata-extraction", "spring-deployment-id"=>"metadata-part1-metadata-extraction-v4", "spring-group-id"=>"metadata-part1"}, "host"=>"ivv-cluster-node-43", "container_name"=>"metadata-part1-metadata-extraction-v4", "docker_id"=>"b20d687b49e93b82bef597d08e2d2f12bb472b3d2dd629007c335ddbacb186ee", "container_hash"=>"artifactory.coprs.esa-copernicus.eu/rs-docker/rs-core-metadata-catalog-extraction@sha256:556dfceac666d28b8b4d3c8f95077fd7681a5cb0e8f07acb583f04ae14ce7b54", "container_image"=>"artifactory.coprs.esa-copernicus.eu/rs-docker/rs-core-metadata-catalog-extraction:1.10.0-rc1"}}

w-fsi commented 1 year ago

I can confirm the observation from @nleconte-csgroup that the field is set to NOMINAL and this is also true for the recent version delivered today. I'll try to clarify the reason behind as there is a explicit logic to set the status to either NOMINAL or DEGENERATED. Additionally there is a duplicated field name as STRING, however I assume that this is supposed to be the "quality info" that is used original to determinate the "quality status".

These fields are just existing for S2 and thus also explains why this could just be observed for that mission.

LAQU156 commented 1 year ago

IVV_CCB_2023_w09 : Need to find a global solution, Action on Eng side to decide a solution to be implemented, moved into "Accepted ENG", action from @Woljtek

Woljtek commented 1 year ago

According to Sentinel-2 Product Unit Definition and Metadata ICD, qualityStatus & qualityInfo attributs shall exists for all levels, i.e.: image.png

To extract these 2 metadata attributs, the Quality_Info field shall be used with the following rules:

qualityStatus

qualityInfo

About the traces, I propose to apply the same rules, I mean: quality_status_integer fills with metadata qualityInfo quality_status_string fills with metadata qualityStatus

Do you agree my proposal?

Note: quality_status_string does not exist "COPRS-ICD-ADST-001363642 - RS Trace tailored format - 4.0" in for S2-L0: image.png

SYTHIER-ADS commented 1 year ago

Dear all,

Following the definition provided by @Woljtek, I think that this apply to S2 products with invetory_metadata.xml (i.e. currently L1 and L2), as explained in the applicable ICD: image

However for L0, we already agreed on how this field is to be calculated and converted:

qualityStatus | "NOMINAL" or "DEGRADED" | "NOMINAL" or "DEGRADED" | "" | SAD_ID | based on quality info value if 0 = 'NOMINAL' otherwise 'DEGRADED' Extracted from the PDI_ID "Instance_ID" Degradation (see section 3.23.1) for SAD -- | -- | -- | -- | -- | -- qualityInfo | Quality_Indicators_Info/Image_Content_QI/Common_IMG_QI/DEGRADED_MSI_DATA_PERCENTAGE | mage_Data_Info/Sensor_Configuration/Source_Packet_Description/Degradation_Summary/NUMBER_OF_LOST_PACKETS | "" | "" | TBC in V2

Am I missing something?

Woljtek commented 1 year ago

So, the mapping to apply is different between the levels

The most inportant is to set the right type, as decribed in my previouscomment.

LAQU156 commented 1 year ago

IVV_CCB_2023_w11 : Moved into "Accepted Werum", Priority blocking Werum_CCB_2023_w11 : Moved into "Sprint Backlog"

Woljtek commented 1 year ago

@pcuq-ads The fix is now deployed on OPS (v1.12.1), could you check if it works ?

pcuq-ads commented 1 year ago

Test to be plan next week. MDC Extraction release 1.12.1 has been deployed on the 22/03.

Test link here : https://processing.platform.ops-csc.com/kibana/app/discover#/view/22359150-c715-11ed-ba7e-3f30b67e0cc8?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-3d,to:now-1d))&_a=(columns:!(header.rs_chain_name,task.event,task.name,task.input.filename_string),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'807173f0-43f9-11ed-bb12-8378e4235da6',key:task.name.keyword,negate:!f,params:(query:MetadataExtraction),type:phrase),query:(match_phrase:(task.name.keyword:MetadataExtraction)))),index:'807173f0-43f9-11ed-bb12-8378e4235da6',interval:auto,query:(language:lucene,query:'task.input.filename_string:*MSI_L0__GR*%20AND%20task.event:BEGIN'),sort:!(!(header.timestamp,desc)))

pcuq-ads commented 1 year ago

SYS_CCB_2023_w13 : Finally ! image.png

The issue is fixed.