GeoscienceAustralia / dea-orchestration

4 stars 1 forks source link

Automate dataset processing information from the pbs logs #96

Closed santoshamohan closed 5 years ago

santoshamohan commented 5 years ago

Reason for this pull request

Analysis of datasets processed during orchestration would enable us to use the NCI resources efficiently. So, need for monitoring dataset processing information such as #datasets found, #datasets indexed, #datasets failed, service units used, etc. from the automated datacube ingest for the current Landsat NBAR/NBART/PQ/WOfS/Fractional Cover products would be a value add to dea-orchestration process. Dataset information are read from the pbs logs and then pushed to AWS Elasticsearch for further analysis.

Proposed solution

  1. Upon new S3 object creation, an event notification is sent to AWS SQS service to delay lambda function execution. The delayed lambda execution enables PBS logs being available in the NCI directory for further processing. The SQS message delivery delay in this configuration is set to 10 minutes.
  2. read_nci_email.py handler is updated to fetch dataset information from the logs by using appropriate regular expression search and push the updated metadata document into Amazon ES.
  3. New raijin_scripts/execute_fetch_dataset_info/run is created to fetch datasets processing info from the pbs logs on the NCI directory.
  4. serverless.yml is updated to provide AWS SQS IMA role and allow AWS SQS to trigger AWS Lambda function.
  5. Minor documentation updates are done to lambda_functions/nci_monitoring/package.json and lambda_functions/nci_monitoring/raijin_ssh.py