MAAP-Project / maap-hec-aws

2 stars 1 forks source link

R1: ADES-K8S update job metrics to standardized format #28

Closed pymonger closed 2 years ago

pymonger commented 2 years ago

Issue #21 (https://app.zenhub.com/workspaces/maap-hec-aws-62619ee48e67030014a08234/issues/maap-project/maap-hec-aws/21) implemented the collection job metrics in the ADES-K8S environment. However, the metrics returned was in the format output by calrissian, e.g.:

{
    "cores_allowed": 1.0,
    "ram_mb_allowed": 1073.741824,
    "children": [
        {
            "cpus": 1.0,
            "ram_megabytes": 268.435456,
            "disk_megabytes": 2.128405,
            "name": "stage_in",
            "start_time": "2022-05-10T20:53:28+00:00",
            "finish_time": "2022-05-10T20:53:32+00:00",
            "elapsed_hours": 0.0011111111111111111,
            "elapsed_seconds": 4.0,
            "ram_megabyte_hours": 0.29826161777777777,
            "cpu_hours": 0.0011111111111111111
        },
        {
            "cpus": 1.0,
            "ram_megabytes": 268.435456,
            "disk_megabytes": 0.012349,
            "name": "downsample_landsat",
            "start_time": "2022-05-10T20:53:35+00:00",
            "finish_time": "2022-05-10T20:54:04+00:00",
            "elapsed_hours": 0.008055555555555555,
            "elapsed_seconds": 29.0,
            "ram_megabyte_hours": 2.1623967288888886,
            "cpu_hours": 0.008055555555555555
        },
        {
            "cpus": 1.0,
            "ram_megabytes": 268.435456,
            "disk_megabytes": 0.0,
            "name": "stage_out",
            "start_time": "2022-05-10T20:54:08+00:00",
            "finish_time": "2022-05-10T20:54:09+00:00",
            "elapsed_hours": 0.0002777777777777778,
            "elapsed_seconds": 1.0,
            "ram_megabyte_hours": 0.07456540444444444,
            "cpu_hours": 0.0002777777777777778
        }
    ],
    "start_time": "2022-05-10T20:53:28+00:00",
    "finish_time": "2022-05-10T20:54:09+00:00",
    "elapsed_hours": 0.01138888888888889,
    "elapsed_seconds": 41.0,
    "total_cpu_hours": 0.009444444444444445,
    "total_ram_megabyte_hours": 2.535223751111111,
    "total_disk_megabytes": 2.140754,
    "total_tasks": 3,
    "max_parallel_cpus": 1.0,
    "max_parallel_ram_megabytes": 268.435456,
    "max_parallel_tasks": 1
}

This task involves updating the metrics returned so that it conforms to the standardized set of metrics:

            job.time_queued # when ADES accepts it
            job.time_started # when ADES starts running the job
            job.time_end # when process ends, regardless if it success or failed
            job.exit_code
            job.work_dir_size # at the end of the job
            job.memory_max
            job.priority # Ops perceived priority across all ADESes. This should affect the estimated queue time.
            node.cores
            node.memory
            node.disk_space_free
            node.ip_address # internal and/or public
            node.hostname
            blob # everything else in text blob field to search later
pymonger commented 2 years ago

Superceded by #40.