Conus DPS Jobs take a long time to run

mccabete commented 2 months ago

Members of the team have long complained that the DPS job (version 2) takes a long time to run. I wanted to put some numbers on that over the course of the fire season.

^ That is from a query (By bugging sujen) of the dps job management system. Seems like a huge seasonal dependance.

Time of run seems proportional to directory size:

Hopefully, version three will be faster.

wildintellect commented 2 months ago

@jsignell and I will get you details @mccabete on how to access more recent v3 run metadata for comparison.

wildintellect commented 2 months ago

Thanks for starting this @mccabete , @jsignell and I will work to get you information on how to access more recent logs so you can analyze the V3 logs. We do ask that you post your log analysis code to github when ready so the rest of the MAAP/VEDA commmunity can learn. https://github.com/MAAP-Project/maap-documentation-examples/ would accept your code if you don't have another home.

chuckwondo commented 1 month ago

@wildintellect, anything I can do here to help out?

wildintellect commented 1 month ago

@chuckwondo take a look at merging the existing logs for a first pass at the data, not sure what the best approach is - pandas, duckdb, etc... (I'll share the bucket info internally)

chuckwondo commented 1 month ago

@wildintellect, can you provide a bit more detail? What do you mean here? Do you want me to merge all of the csv files from the bucket/key-prefix info you gave me into a single df, and perform something like you did in the gist you also shared with me? Do you want me to create another gist, put the code somewhere else?

chuckwondo commented 1 month ago

@mccabete, can you share your code that you used for generating your graphs above?

mccabete commented 3 weeks ago

Hey @chuckwondo. Nothing revolutionary, the data is coming from a CSV Sujen generated for me.

import pandas as pd
import matplotlib.pyplot as plt

data_path = '/projects/fireatlas_utils/data/EIS Job report - 20240208.csv'

dps = pd.read_csv(data_path, 
                  header=0,
                  names=["timestamp","id","job_type","job_tag","cmd_time","total_time","dir_size","job_queue","status"],
                  dtype={"timestamp":object,
                     "id":str,
                     "job_type":str,
                     "job_tag":str,
                     "cmd_time":str,
                     "total_time":str,
                     "dir_size":str,
                     "job_queue":str,
                     "status":int}
                 )

dps['cmd_time'] = dps['cmd_time'].replace(',', '', regex=True)
dps['cmd_time'] = dps['cmd_time'].astype(float)

dps['total_time'] = dps['total_time'].replace(',', '', regex=True)
dps['total_time'] = dps['total_time'].astype(float)

dps['dir_size'] = dps['dir_size'].replace(',', '', regex=True)
dps['dir_size'] = dps['dir_size'].astype(float)

dps['timestamp'] = dps['timestamp'].replace('@', '', regex=True)
dps['timestamp'] = dps['timestamp'].astype('datetime64[ns]')

dps['cmd_time_mins'] = dps['cmd_time']/60

plt.plot(dps[dps.job_type == "job-eis-fire-feds-nrt-dask-dps-v2:conus-dps"].timestamp.astype("datetime64[ns]"), dps[dps.job_type == "job-eis-fire-feds-nrt-dask-dps-v2:conus-dps"].cmd_time_mins/60)
plt.ylabel("CONUS DPS runtime hours")
plt.xticks(rotation=90)

chuckwondo commented 3 weeks ago

Thanks @mccabete. I thought roughly as such.

Since we don't know where best to put things yet, I've created a gist for reading and querying the DPS Job Metrics CSV files. It's a private gist, but you can access it via this link: https://gist.github.com/chuckwondo/6a731811226bad18523f595da9ef5697

Let me know if you have any questions about it.

mccabete commented 3 weeks ago

Nice! My only question is about the underlying data we are querying. Is it the MAAP logs? What are the temporal constraints? (You mention March and April being incomplete -- is that because this just came online?)

chuckwondo commented 3 weeks ago

Nice! My only question is about the underlying data we are querying. Is it the MAAP logs? What are the temporal constraints? (You mention March and April being incomplete -- is that because this just came online?)

The data is the DPS Job logs as supplied by JPL. Same format as the CSV file you used in your own code snippet.

The temporal constraints are due to when JPL started putting the logs in the location that the gist uses. Things started in March, but apparently something fell over while Sujen was out, and wasn't restarted until he returned, so there are days missing from March and April. May is complete (I think), and of course, June is ongoing at this point.

You could certainly start from March (instead of May, which is what I put in the gist). You just won't have much data from March and April.

mccabete commented 3 weeks ago

@jsignell and @ranchodeluxe - I am trying to do an apples to apples comparison of V2 and V3. Is the dask coordination job a "fair" comparison to the conus nrt stuff?

mccabete commented 3 weeks ago

Actually @chuckwondo I have a few more questions about what the columns in the data frame mean.

Is "cmd time" referring to the actual execution time (as opposed to queueing time?)?
Does the "timestamp" refer to start time of the job?
What do status 0 and 1 mean?

chuckwondo commented 3 weeks ago

Actually @chuckwondo I have a few more questions about what the columns in the data frame mean.

Is "cmd time" referring to the actual execution time (as opposed to queueing time?)?

Does the "timestamp" refer to start time of the job?

What do status 0 and 1 mean?

What do you mean by "cmd time"? There's a cmd_duration_hrs column, which is the duration of the command (algorithm) in hours, matching what you plotted previously.

Regarding timestamp, you'll have to ask JPL exactly what that means. They're the ones producing the CSV files. However, it appears that timestamp is after end_datetime, so perhaps this isn't particularly useful to us. What probably makes better sense is to drop timestamp and use start_datetime instead.

Regarding status, again, a JPL question, but I'm fairly sure that indicates whether or not the job succeeded or failed. A successful job should have a status of 0, and an unsuccessful job should have any non-zero status (likely 1, but not necessarily), representing the exit status of the algorithm's "run" bash script.

I'll update the gist to use start_datetime instead of timestamp.

wildintellect commented 3 weeks ago

@sujen1412 can you clarify the log column meanings?

chuckwondo commented 3 weeks ago

@mccabete, upon further investigation, I see that "cmd time" is not what I thought it was (I don't know what it is), but "total time" is what we want. The "total time" is indeed the difference between "start time" and "end time", so I made the following updates to the gist:

Use "total time" instead of "cmd time", which is the total execution time of the algorithm (i.e., this does not include time in the queue)
Use "start time" instead of "timestamp" for plotting against execution time (total time)
Select only rows where status is 0, so that we look at times only for successful jobs

mccabete commented 1 week ago

@eorland -- here is the most up to date figure on how long the v3 runs are taking

mccabete commented 1 week ago

@jsignell and @ranchodeluxe - I am trying to do an apples to apples comparison of V2 and V3. Is the dask coordination job a "fair" comparison to the conus nrt stuff?

I forgot to bring this up in our last meeting. Can either of your comment on if comparing these two jobs directly as a proxy for V3 vs V2 speedup is "fair"? I am unclear on if the v3 job contains a bunch of different smaller jobs, so it might not be fair.

ranchodeluxe commented 1 week ago

@jsignell and @ranchodeluxe - I am trying to do an apples to apples comparison of V2 and V3. Is the dask coordination job a "fair" comparison to the conus nrt stuff?

I forgot to bring this up in our last meeting. Can either of your comment on if comparing these two jobs directly as a proxy for V3 vs V2 speedup is "fair"? I am unclear on if the v3 job contains a bunch of different smaller jobs, so it might not be fair.

yeah, dask is the one to read but it does contain a bunch of smaller jobs for pre-process and post-process. still think it's fair to compare to a clean v2 run

mccabete commented 5 days ago

@wildintellect or @chuckwondo

Can you confirm that I can get the "queued_time" by subtracting the queued_datetime from end_datetime?

sujen1412 commented 5 days ago

I added a description for the fields being exported - https://github.com/MAAP-Project/Community/issues/1019#issuecomment-2204011595

cc @chuckwondo @wildintellect

mccabete commented 5 days ago

Thank you Sujen. I am copying it here for reference:

Short description of the fields.

@timestamp: Time the metric was ingested into Elasticsearch.

_id: id of the document in elasticsearch.

job_type: Algorithm Type.

job.tag: User provided tag (identifier in maap.submitJob()).

job.job_info.cmd_duration: Duration for which the docker cmd ran. This include the users run script and a wrapper script that moves (not download) input files into the input dir and renames the output directory with a timestamp.

job.job_info.duration: Total duration for the job as per HySDS, includes time to download the container + any input files + the above job.job_info.cmd_duration.

job.job_info.metrics.job_dir_size: Size of the working directory of the job.

job.job_info.job_queue: Resource queue when the job was submitted.

job.job_info.status: Completion status of the job.

job.job_info.time_queued: Time job was sent to HySDS and added to the resource queue.

job.job_info.time_start: Time HySDS started the job i.e before it starts to download the container and any input files. This can also be interpreted as the time a worker picked up the job from the queue.

job.job_info.time_end: Time the job completed, does not include time to upload output products.

We capture the time it took to upload the output product in a different field which is not being currently exported in the metrics.

mccabete commented 5 days ago

So the units of my above plot is correct. Turns out the cmd_hours is the result of end - start, and total is queued_time - end time.

Here is the time queued

Now that I know my units are right, @ranchodeluxe did V3 stop ~ June 7th? is it called something else?

mccabete commented 5 days ago

My conclusions are:

We don't loose that much time to queueing.
The V2 takes ~ 1.5 hours or so for NRT. Sometimes it takes longer. Why? Dunno.
V3 sometimes takes longer too.

mccabete commented 5 days ago

@mccabete, upon further investigation, I see that "cmd time" is not what I thought it was (I don't know what it is), but "total time" is what we want. The "total time" is indeed the difference between "start time" and "end time", so I made the following updates to the gist:

Use "total time" instead of "cmd time", which is the total execution time of the algorithm (i.e., this does not include time in the queue)

Use "start time" instead of "timestamp" for plotting against execution time (total time)

Select only rows where status is 0, so that we look at times only for successful jobs

Wait, this can't be right. total_time does include the queued time:

chuckwondo commented 4 days ago

@mccabete, the total seconds of job.job_info.time_end - job.job_info.time_start (end_datetime - start_datetime) is equal to job.job_info.duration (execution_time_secs). This does not include time in the queue.

Earth-Information-System / fireatlas

Conus DPS Jobs take a long time to run #2