Closed pcuq-ads closed 1 year ago
@pcuq-ads : Just to be sure you are referring here to the message "END metadata extraction"? I just checked the code and there is no specific handling for S2, so the behaviour across the mission shall be the same. Did you check if for the affected products there is an error raised? Because the only option I see for this scenario is that some unexpected exception occurred during the metadata extraction itself.
In this case, I need some log file from the extraction itself however to check why it occured. Is this just affecting this product type or also other S2 products?
Be aware that with the delivery today a version is delivered that will change the way how S2 metadata is extracted. It might be that it solves on its own by it, I still would like to understand what exactly was going wrong during the extraction.
@w-fsi: I check again the issue and I confirm the issue. I do not filter on status OK or NOK.
If we count events BEGIN and END for the service, we see a huge discrepancy for Sentinel-2 product to be stored on catalog.
The same graph on Monday (dedicated to S1 production).
The same graph on Wednesday (dedicated to Sentinel-3), There is a gap at 00:00 , I can check but for sure it is about remaining activities on S2 the day before.
The impact is that we do not have information on database about S2 production. Blocking for PI computing.
We see errors only about SADATA, around 60 on the same period. The issue deal with 15K error on the same period.
I check with LOKI the log.
{pod="metadata-catalog-part1-metadata-extraction-v7-547885959b-bpmzq"} |~ ERROR|WARN|WARNING
!= SADATA
If I remove SADATA issues (60), there is no ERROR remaining on the Sentinel-2 days. Only warning.
Here :
https://app.zenhub.com/files/398313496/1b8554b9-d5cd-4ad3-8543-2185fb28b846/download
Again the catalog service start with a BEGIN and do not provide any END trace for 15K messages on the day with Level-0 S2 production.
Hm, I am suprised that there are no error traces existing.
When checking the log I see some warning that looks like what I was searching for:
2023-02-14T19:09:38Z {"header":{"type":"LOG","timestamp":"2023-02-14T19:09:38.244165Z","level":"WARN","line":36,"file":"Retries.java","thread":"KafkaConsumerDestination{consumerDestinationName='metadata-catalog-part1.metadata-filter', partitions=5, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Error on performing Create metadata {\"productFamily\":\"S2_L0_GR\",\"creationTime\":\"2023-02-14T18:38:33.000000Z\",\"insertionTime\":\"2023-02-14T19:09:28.227500Z\",\"instrumentShortName\":\"L0_\",\"productGroupId\":\"S2A_OPER_MSI_L0__DS_REFS_20230214T183833_S20230213T233437_N02.08\",\"productName\":\"S2A_OPER_MSI_L0__GR_REFS_20230214T183833_S20230213T233451_D09_N02.08\",\"orbitNumber\":\"\",\"qualityInfo\":0,\"platformShortName\":\"SENTINEL-2\",\"qualityStatus\":\"NOMINAL\",\"startTime\":\"2023-02-13T23:34:51.894000Z\",\"stopTime\":\"2023-02-13T23:34:51.894000Z\",\"productType\":\"MSI_L0__GR\",\"productClass\":\"OPER\",\"missionId\":\"S2\",\"satelliteId\":\"A\",\"relativeOrbitNumber\":\"\",\"coordinates\":{\"orientation\":\"counterclockwise\",\"coordinates\":[[[159.357627802118,-29.9518075271036],[159.260732838059,-30.3161089778449],[159.521680254522,-30.373028330575],[159.616816999806,-30.0078386285303],[159.357627802118,-29.9518075271036]]],\"type\":\"polygon\"},\"processorVersion\":\"\",\"t0PdgsDate\":\"2023-02-14T00:17:00.000000Z\",\"url\":\"S2A_OPER_MSI_L0__GR_REFS_20230214T183833_S20230213T233451_D09_N02.08\",\"platformSerialIdentifier\":\"S2A\",\"operationalMode\":\"\",\"processingCenter\":\"RS\",\"timeliness\":\"S2_L0\",\"processMode\":\"NOMINAL\"} (1/2), retrying in 1000ms"},"custom":{"logger_string":"esa.s1pdgs.cpoc.common.utils.Retries"}}
The reason for the exception is not clear for me at the moment, but if an exception is thrown, I would expect that it results in an ERROR trace.
@pcuq-ads Could you please provide an unfiltered extract from the log from around 2023-02-14T19:09:38Z ? We are trying to make sense of what the service is running into as the problem occurs while saving the metadata for the S2_L0_GR in the ES. As the logs are already a week old, we have no other way to take a look at them.
@pcuq-ads Never claimed it. But the log is the only hint we are having of what actually happened. It seems like some communication issue with the ES server occured (best guess, missing indices?). When looking at the current errors in the extraction (missing sad index) a report error is generated however.
Here are the log requested : https://app.zenhub.com/files/398313496/8ece293c-7af7-483c-9588-2a6251fa8e08/download
@pcuq-ads Thank you for the quick response. I checked the logs, and the WARN that we found earlier, does not seem to be of any issue, as the metadata could be generated as expected. In fact the Log contains 102 BEGIN and 103 END traces, so everything seems to be in line here.
From your graph earlier for the production on Tuesday, it seems that the time frame around 04:00 seems to be interesting, as there is a big discrepancy between BEGIN and END traces. Would it be possible to provide an unfiltered log from that time frame, so we can take a look, if we find a clue there?
Regarding the ERROR messages of the S2_SAD products, the root cause is a missing ES index for the product family. To fix this issue, you need to create the index s2_sad
IVV_CCB_2023_w08 : Moved into "Accepted Werum", First analysis done by Werum team, Priority blocking
@w-jka : The bug about SADATA is the following one : #829 . I am extracting peace of log from LOKI and I could not find the discrepancies. This is strange. I will try either : 1) to find discrepancies on LOG => issue on Metadata Catalog 2) to find an UID END trace on LOKI and not on TRACE => issue on INFRA
I will do the check and I come back to you.
I check period from 14/02/2023 @20:00:00 to 14/02/2023 @22:30:00 The following "END" task are present on LOG and not on TRACEs b056cda0-d023-4b03-9c18-593b84257505 d09d7bb3-4166-472a-a1e8-88042fef02c2 0e90c72d-433f-4d21-8623-15d7e537b07d 4d25b259-e8fe-43b0-8107-b570f5b80954 9598a121-6ec7-4d36-98b0-959531291218 e9e047e7-5f13-4c3a-81bb-4802d505e841 3937dcef-1fd4-4cbf-967c-9fa7f98b77ef 86d9acba-7955-4759-83e4-470d953aacfa bfb06e30-c6da-4bb3-8fc8-9022c97607eb df17708b-585b-4d81-8167-7cb13975fbbf 5e4cfd93-1866-4761-a801-3010f4d3023e ca0585ec-83b7-4ac1-a121-fdfd3ea793a1 d9cb705e-d8e3-4933-a433-dea6a3c08278
etc...
Here are the log (REPORT) on LOKI side.
{"header":{"type":"REPORT","timestamp":"2023-02-14T21:17:04.471000Z","level":"INFO","mission":"S2","workflow":"NOMINAL","rs_chain_name":"METADATA","rs_chain_version":"1.8.0-rc1"},"message":{"content":"End metadata extraction"},"task":{"uid":"b056cda0-d023-4b03-9c18-593b84257505","name":"MetadataExtraction","event":"END","status":"OK","output":{"mission_identifier_string":"S2","type_string":"S2_L0_GR","product_sensing_start_date":"2023-02-14T00:23:14.165000Z","product_sensing_stop_date":"2023-02-14T00:23:14.165000Z","timeliness_name_string":"S2_L0","timeliness_value_seconds_integer":4500,"product_metadata_custom_object":{"processing_level_integer":0,"product_group_id":"S2B_OPER_MSI_L0__DS_REFS_20230214T194239_S20230214T002053_N02.08","quality_status_integer":"NOMINAL","orbit_number_integer":"","processor_version_string":"","product_type_string":"MSI_L0__GR","platform_serial_identifier_string":"B","ending_date_time_date":"2023-02-14T00:23:14.165000Z","instrument_short_name_string":"L0_","platform_short_name_string":"SENTINEL-2","beginning_date_time_date":"2023-02-14T00:23:14.165000Z"},"end_to_end_product_boolean":true},"input":{"filename_string":"S2B_OPER_MSI_L0__GR_REFS_20230214T194239_S20230214T002314_D06_N02.08"},"quality":{},"error_code":0,"duration_in_seconds":1.01,"missing_output":[]}}
{"header":{"type":"REPORT","timestamp":"2023-02-14T21:16:51.303000Z","level":"INFO","mission":"S2","workflow":"NOMINAL","rs_chain_name":"METADATA","rs_chain_version":"1.8.0-rc1"},"message":{"content":"End metadata extraction"},"task":{"uid":"d09d7bb3-4166-472a-a1e8-88042fef02c2","name":"MetadataExtraction","event":"END","status":"OK","output":{"mission_identifier_string":"S2","type_string":"S2_L0_GR","product_sensing_start_date":"2023-02-14T00:23:10.557000Z","product_sensing_stop_date":"2023-02-14T00:23:10.557000Z","timeliness_name_string":"S2_L0","timeliness_value_seconds_integer":4500,"product_metadata_custom_object":{"processing_level_integer":0,"product_group_id":"S2B_OPER_MSI_L0__DS_REFS_20230214T194239_S20230214T002053_N02.08","quality_status_integer":"NOMINAL","orbit_number_integer":"","processor_version_string":"","product_type_string":"MSI_L0__GR","platform_serial_identifier_string":"B","ending_date_time_date":"2023-02-14T00:23:10.557000Z","instrument_short_name_string":"L0_","platform_short_name_string":"SENTINEL-2","beginning_date_time_date":"2023-02-14T00:23:10.557000Z"},"end_to_end_product_boolean":true},"input":{"filename_string":"S2B_OPER_MSI_L0__GR_REFS_20230214T194239_S20230214T002310_D05_N02.08"},"quality":{},"error_code":0,"duration_in_seconds":1.012,"missing_output":[]}}
{"header":{"type":"REPORT","timestamp":"2023-02-14T21:16:39.102000Z","level":"INFO","mission":"S2","workflow":"NOMINAL","rs_chain_name":"METADATA","rs_chain_version":"1.8.0-rc1"},"message":{"content":"End metadata extraction"},"task":{"uid":"e66c7204-8974-4a8b-8b3e-e581b6bc96ef","name":"MetadataExtraction","event":"END","status":"OK","output":{"mission_identifier_string":"S2","type_string":"S2_L0_GR","product_sensing_start_date":"2023-02-14T00:21:54.794000Z","product_sensing_stop_date":"2023-02-14T00:21:54.794000Z","timeliness_name_string":"S2_L0","timeliness_value_seconds_integer":4500,"product_metadata_custom_object":{"processing_level_integer":0,"product_group_id":"S2B_OPER_MSI_L0__DS_REFS_20230214T194239_S20230214T002053_N02.08","quality_status_integer":"NOMINAL","orbit_number_integer":"","processor_version_string":"","product_type_string":"MSI_L0__GR","platform_serial_identifier_string":"B","ending_date_time_date":"2023-02-14T00:21:54.794000Z","instrument_short_name_string":"L0_","platform_short_name_string":"SENTINEL-2","beginning_date_time_date":"2023-02-14T00:21:54.794000Z"},"end_to_end_product_boolean":true},"input":{"filename_string":"S2B_OPER_MSI_L0__GR_REFS_20230214T194239_S20230214T002154_D06_N02.08"},"quality":{},"error_code":0,"duration_in_seconds":1.022,"missing_output":[]}}
So this is an infrastructure issue.
IVV_CCB_2023_w08 [EDIT] : Moved into "Accepted CS"
CS_CCB_2023_w08 : Moved into "Product Backlog", highest priority
Some context :
"host": "cluster-ops-node-97",
"container_name": "metadata-catalog-part1-metadata-extraction-v7",
"docker_id": "e676a70b3f1b4da6ed728146618a72972f312278e498cfdbcdf30556cb2f087c",
"container_hash": "artifactory.coprs.esa-copernicus.eu/rs-docker/rs-core-metadata-catalog- extraction@sha256:25fc33d66fb4ae597ba76ee8b618fb17dae7968f05ab9d6dc54a46a41bd39bc2",
"container_image": "artifactory.coprs.esa-copernicus.eu/rs-docker/rs-core-metadata-catalog-extraction:1.8.0-rc
The END trace is missing in the Elasticsearch because of wrong mapping for an integer field. The field task.output.product_metadata_custom_object.quality_status_integer contains "NOMINAL" :
2023-02-22 16:53:36 +0000 [warn]: #0 fluent/log.rb:351:warn: dump an error event: error_class=Fluent::Plugin::ElasticsearchErrorHandler::ElasticsearchError error="400 - Rejected by Elasticsearch [error type]: mapper_parsing_exception [reason]: 'failed to parse field [task.output.product_metadata_custom_object.quality_status_integer] of type [integer] in document with id 't40LeoYBrZeZNgw8b8kR'. Preview of field's value: 'NOMINAL''" location=nil tag="fluentbit.trace" time=2023-02-22 16:52:44.314186590 +0000 record={"header"=>{"type"=>"REPORT", "timestamp"=>"2023-02-15T16:31:29.467000Z", "level"=>"INFO", "mission"=>"S2", "workflow"=>"NOMINAL", "rs_chain_name"=>"METADATA", "rs_chain_version"=>"1.10.0-rc1"}, "message"=>{"content"=>"End metadata extraction"}, "task"=>{"uid"=>"7c140b82-0dc7-4b0a-bdd6-0f095149c04c", "name"=>"MetadataExtraction", "event"=>"END", "status"=>"OK", "output"=>{"mission_identifier_string"=>"S2", "type_string"=>"S2_L0_GR", "product_sensing_start_date"=>"2022-10-11T06:25:44.474000Z", "product_sensing_stop_date"=>"2022-10-11T06:25:44.474000Z", "timeliness_name_string"=>"S2_L0", "timeliness_value_seconds_integer"=>4500, "product_metadata_custom_object"=>{"product_group_id"=>"S2B_OPER_MSI_L0__DS_REFS_20230215T152221_S20221011T062334_N04.00", "processing_level_integer"=>0, "quality_status_integer"=>"NOMINAL", "orbit_number_integer"=>"", "processor_version_string"=>"", "product_type_string"=>"MSI_L0__GR", "platform_serial_identifier_string"=>"B", "ending_date_time_date"=>"2022-10-11T06:25:44.474000Z", "instrument_short_name_string"=>"MSI", "platform_short_name_string"=>"SENTINEL-2", "beginning_date_time_date"=>"2022-10-11T06:25:44.474000Z"}, "end_to_end_product_boolean"=>true}, "input"=>{"filename_string"=>"S2B_OPER_MSI_L0__GR_REFS_20230215T152221_S20221011T062544_D11_N04.00"}, "quality"=>{}, "error_code"=>0, "duration_in_seconds"=>0.046, "missing_output"=>[]}, "kubernetes"=>{"pod_name"=>"metadata-part1-metadata-extraction-v4-7d8b9d65ff-524t6", "namespace_name"=>"processing", "pod_id"=>"e6a09afa-67d8-41ae-ae75-767f0e3c466f", "labels"=>{"linkerd_io/control-plane-ns"=>"networking", "linkerd_io/proxy-deployment"=>"metadata-part1-metadata-extraction-v4", "linkerd_io/workload-ns"=>"processing", "pod-template-hash"=>"7d8b9d65ff", "role"=>"spring-app", "spring-app-id"=>"metadata-part1-metadata-extraction-v4", "spring-application-name"=>"metadata-extraction", "spring-deployment-id"=>"metadata-part1-metadata-extraction-v4", "spring-group-id"=>"metadata-part1"}, "host"=>"ivv-cluster-node-43", "container_name"=>"metadata-part1-metadata-extraction-v4", "docker_id"=>"b20d687b49e93b82bef597d08e2d2f12bb472b3d2dd629007c335ddbacb186ee", "container_hash"=>"artifactory.coprs.esa-copernicus.eu/rs-docker/rs-core-metadata-catalog-extraction@sha256:556dfceac666d28b8b4d3c8f95077fd7681a5cb0e8f07acb583f04ae14ce7b54", "container_image"=>"artifactory.coprs.esa-copernicus.eu/rs-docker/rs-core-metadata-catalog-extraction:1.10.0-rc1"}}
I can confirm the observation from @nleconte-csgroup that the field is set to NOMINAL and this is also true for the recent version delivered today. I'll try to clarify the reason behind as there is a explicit logic to set the status to either NOMINAL or DEGENERATED. Additionally there is a duplicated field name as STRING, however I assume that this is supposed to be the "quality info" that is used original to determinate the "quality status".
These fields are just existing for S2 and thus also explains why this could just be observed for that mission.
IVV_CCB_2023_w09 : Need to find a global solution, Action on Eng side to decide a solution to be implemented, moved into "Accepted ENG", action from @Woljtek
According to Sentinel-2 Product Unit Definition and Metadata ICD, qualityStatus
& qualityInfo
attributs shall exists for all levels, i.e.:
To extract these 2 metadata attributs, the Quality_Info
field shall be used with the following rules:
qualityStatus
qualityStatus
string
if Quality_Info is "100.0":
qualityStatus="NOMINAL"
else:
qualityStatus="DEGRADED"
qualityInfo
qualityInfo
int64
int(Quality_Info)
About the traces, I propose to apply the same rules, I mean:
quality_status_integer
fills with metadata qualityInfo
quality_status_string
fills with metadata qualityStatus
Do you agree my proposal?
Note: quality_status_string
does not exist "COPRS-ICD-ADST-001363642 - RS Trace tailored format - 4.0" in for S2-L0:
Dear all,
Following the definition provided by @Woljtek, I think that this apply to S2 products with invetory_metadata.xml (i.e. currently L1 and L2), as explained in the applicable ICD:
However for L0, we already agreed on how this field is to be calculated and converted:
Am I missing something?
So, the mapping to apply is different between the levels
The most inportant is to set the right type, as decribed in my previouscomment.
IVV_CCB_2023_w11 : Moved into "Accepted Werum", Priority blocking Werum_CCB_2023_w11 : Moved into "Sprint Backlog"
@pcuq-ads The fix is now deployed on OPS (v1.12.1), could you check if it works ?
Test to be plan next week. MDC Extraction release 1.12.1 has been deployed on the 22/03.
SYS_CCB_2023_w13 : Finally !
The issue is fixed.
Environment:
Traçability:
Current Behavior: The Metadata Extraction catalog does not generate END trace for product S2B_OPER_MSI_L0__GR
Expected Behavior: The Metadata Extraction catalog shall generate BEGIN and END trace for every activity. The impact is that the product does not appear on the Product table and we can note compute the performance.
Steps To Reproduce: Check BEGIN and END trace count over time for task.name=MetadataExtraction .
Test execution artefacts (i.e. logs, screenshots…) If we count trace over time, we see that the BEGIN/END count diverges during S2 production.
We do not see END trace on these period for S2 (even with status NOK).
If we filter on mission S1 and S3, everything is OK. (OUF !)
Whenever possible, first analysis of the root cause
Bug Generic Definition of Ready (DoR)
Bug Generic Definition of Done (DoD)