NASA-IMPACT / csdap-cumulus

SmallSat Cumulus Deployment
Other
1 stars 1 forks source link

Create reusable log aggregator tools #227

Closed krisstanton closed 10 months ago

krisstanton commented 1 year ago

Create reusable log aggregator tools This applies to Airflow logs and Cumulus logs. This is essentially a reusable script which wraps AWS CLI commands and can generate reports for each run.

krisstanton commented 1 year ago

Moving this forward to the next sprint and backlogging it until the Orca stuff is completed.

krisstanton commented 10 months ago

Update on this ticket.

I now have a somewhat functioning Log Aggregator for Cumulus (as of Last PROD version, BEFORE the recent optimizations that live in CBA Prod)

Below this line are some of the notes I collected relating to challenges around collecting a flat set of logs and specific granule IDs

I tried a number of things to make this faster, more effective, more efficient and easier to use/understand

Here are some of the problems I ran in to.
    -No way to get a count of execution histories - So we don't know exactly how many executions there are unless we ping them all 1 by 1
    -The CLI for getting execution ARNs executes fast, but only if you pair it with a 'head' command, which results in a broken pipe (meaning we are only reading the first lines of the results and not the whole thing - I'm assuming the data is streaming back from AWS so this should be ok) and only allows us to get result pages 1 through N, rather than just page N (each page has 100 results)
        -I tried a few different things that did not work, (I tried pairing the HEAD with the TAIL command, I tried AWK, etc).  Everytime I tried one of these alternate query types, the request took minutes to execute and after checking, usually was not giving the correct data.
        -The only CLI that went fast was the one that pipes the output using head only.
    -Wrapping into Python was a little bit challenging, I ended up having to make command line parameters so that we can run this for different environments.

    One problem that will come up later:
        We have made changes to the structure of the statemachine.  I think at least some of those changes will have downstream affects on these CLI commands.
        If so, it is likely this will require us to revisit this code and update it.
        We will only know this once we try to run this on the new CBA Prod after we do some of the ingests.

It takes about an hour to run 1000 results.

If it is possible, I suggested we make a change to the Statemachine (or even insert a new lambda) which would run only on fail.
    -This lambda should send info to a separate log group (with a name similar to: cumulus_failed_ingests)
    -The info that should be sent should be:
        -The error message
        -The granule infos that were not ingested (Ids, file names, etc)
        -Some Meta info (including)
            -The ExecutionArn
            -The ParentExecutionArn

    -If we had a flat log group with ONLY errors, it would be significantly easier to grab ALL the errors and parse them all for uniqueness using the method developed here.
chuckwondo commented 10 months ago

@krisstanton, is there a branch where you've put your script? There are a number of things you can do that I've mentioned over several conversations about this that will avoid the issues you're describing. I'm happy to pair up to dive deeper into this with you.

krisstanton commented 10 months ago

Thanks @chuckwondo for helping me out with this one. I added you to the ticket.
Another quick update. Chuck created a nifty type script for selecting the many disparate AWS Logs data for Cumulus with some improvements upon the python version I've completed. I'll make another ticket for the next sprint to fully integrate the aggregating capabilities with the selection code Chuck made. At the end of that ticket, this utility will live in the scripts directory of our Cumulus implementation and be able to scan and aggregate Airflow and Cumulus logs for errors and aggregate them - we are very close to this result now.