Open mccabete opened 2 months ago
@jsignell and I will get you details @mccabete on how to access more recent v3 run metadata for comparison.
Thanks for starting this @mccabete , @jsignell and I will work to get you information on how to access more recent logs so you can analyze the V3 logs. We do ask that you post your log analysis code to github when ready so the rest of the MAAP/VEDA commmunity can learn. https://github.com/MAAP-Project/maap-documentation-examples/ would accept your code if you don't have another home.
@wildintellect, anything I can do here to help out?
@chuckwondo take a look at merging the existing logs for a first pass at the data, not sure what the best approach is - pandas, duckdb, etc... (I'll share the bucket info internally)
@wildintellect, can you provide a bit more detail? What do you mean here? Do you want me to merge all of the csv files from the bucket/key-prefix info you gave me into a single df, and perform something like you did in the gist you also shared with me? Do you want me to create another gist, put the code somewhere else?
@mccabete, can you share your code that you used for generating your graphs above?
Hey @chuckwondo. Nothing revolutionary, the data is coming from a CSV Sujen generated for me.
import pandas as pd
import matplotlib.pyplot as plt
data_path = '/projects/fireatlas_utils/data/EIS Job report - 20240208.csv'
dps = pd.read_csv(data_path,
header=0,
names=["timestamp","id","job_type","job_tag","cmd_time","total_time","dir_size","job_queue","status"],
dtype={"timestamp":object,
"id":str,
"job_type":str,
"job_tag":str,
"cmd_time":str,
"total_time":str,
"dir_size":str,
"job_queue":str,
"status":int}
)
dps['cmd_time'] = dps['cmd_time'].replace(',', '', regex=True)
dps['cmd_time'] = dps['cmd_time'].astype(float)
dps['total_time'] = dps['total_time'].replace(',', '', regex=True)
dps['total_time'] = dps['total_time'].astype(float)
dps['dir_size'] = dps['dir_size'].replace(',', '', regex=True)
dps['dir_size'] = dps['dir_size'].astype(float)
dps['timestamp'] = dps['timestamp'].replace('@', '', regex=True)
dps['timestamp'] = dps['timestamp'].astype('datetime64[ns]')
dps['cmd_time_mins'] = dps['cmd_time']/60
plt.plot(dps[dps.job_type == "job-eis-fire-feds-nrt-dask-dps-v2:conus-dps"].timestamp.astype("datetime64[ns]"), dps[dps.job_type == "job-eis-fire-feds-nrt-dask-dps-v2:conus-dps"].cmd_time_mins/60)
plt.ylabel("CONUS DPS runtime hours")
plt.xticks(rotation=90)
Thanks @mccabete. I thought roughly as such.
Since we don't know where best to put things yet, I've created a gist for reading and querying the DPS Job Metrics CSV files. It's a private gist, but you can access it via this link: https://gist.github.com/chuckwondo/6a731811226bad18523f595da9ef5697
Let me know if you have any questions about it.
Nice! My only question is about the underlying data we are querying. Is it the MAAP logs? What are the temporal constraints? (You mention March and April being incomplete -- is that because this just came online?)
Nice! My only question is about the underlying data we are querying. Is it the MAAP logs? What are the temporal constraints? (You mention March and April being incomplete -- is that because this just came online?)
The data is the DPS Job logs as supplied by JPL. Same format as the CSV file you used in your own code snippet.
The temporal constraints are due to when JPL started putting the logs in the location that the gist uses. Things started in March, but apparently something fell over while Sujen was out, and wasn't restarted until he returned, so there are days missing from March and April. May is complete (I think), and of course, June is ongoing at this point.
You could certainly start from March (instead of May, which is what I put in the gist). You just won't have much data from March and April.
@jsignell and @ranchodeluxe - I am trying to do an apples to apples comparison of V2 and V3. Is the dask coordination job a "fair" comparison to the conus nrt stuff?
Actually @chuckwondo I have a few more questions about what the columns in the data frame mean.
Actually @chuckwondo I have a few more questions about what the columns in the data frame mean.
- Is "cmd time" referring to the actual execution time (as opposed to queueing time?)?
- Does the "timestamp" refer to start time of the job?
- What do status 0 and 1 mean?
What do you mean by "cmd time"? There's a cmd_duration_hrs
column, which is the duration of the command (algorithm) in hours, matching what you plotted previously.
Regarding timestamp
, you'll have to ask JPL exactly what that means. They're the ones producing the CSV files. However, it appears that timestamp
is after end_datetime
, so perhaps this isn't particularly useful to us. What probably makes better sense is to drop timestamp
and use start_datetime
instead.
Regarding status
, again, a JPL question, but I'm fairly sure that indicates whether or not the job succeeded or failed. A successful job should have a status of 0, and an unsuccessful job should have any non-zero status (likely 1, but not necessarily), representing the exit status of the algorithm's "run" bash script.
I'll update the gist to use start_datetime
instead of timestamp
.
@sujen1412 can you clarify the log column meanings?
@mccabete, upon further investigation, I see that "cmd time" is not what I thought it was (I don't know what it is), but "total time" is what we want. The "total time" is indeed the difference between "start time" and "end time", so I made the following updates to the gist:
@eorland -- here is the most up to date figure on how long the v3 runs are taking
@jsignell and @ranchodeluxe - I am trying to do an apples to apples comparison of V2 and V3. Is the dask coordination job a "fair" comparison to the conus nrt stuff?
I forgot to bring this up in our last meeting. Can either of your comment on if comparing these two jobs directly as a proxy for V3 vs V2 speedup is "fair"? I am unclear on if the v3 job contains a bunch of different smaller jobs, so it might not be fair.
@jsignell and @ranchodeluxe - I am trying to do an apples to apples comparison of V2 and V3. Is the dask coordination job a "fair" comparison to the conus nrt stuff?
I forgot to bring this up in our last meeting. Can either of your comment on if comparing these two jobs directly as a proxy for V3 vs V2 speedup is "fair"? I am unclear on if the v3 job contains a bunch of different smaller jobs, so it might not be fair.
yeah, dask is the one to read but it does contain a bunch of smaller jobs for pre-process and post-process. still think it's fair to compare to a clean v2 run
@wildintellect or @chuckwondo
Can you confirm that I can get the "queued_time" by subtracting the queued_datetime
from end_datetime
?
I added a description for the fields being exported - https://github.com/MAAP-Project/Community/issues/1019#issuecomment-2204011595
cc @chuckwondo @wildintellect
Thank you Sujen. I am copying it here for reference:
Short description of the fields.
@timestamp: Time the metric was ingested into Elasticsearch.
_id: id of the document in elasticsearch.
job_type: Algorithm Type.
job.tag: User provided tag (identifier in maap.submitJob()).
job.job_info.cmd_duration: Duration for which the docker cmd ran. This include the users run script and a wrapper script that moves (not download) input files into the input dir and renames the output directory with a timestamp.
job.job_info.duration: Total duration for the job as per HySDS, includes time to download the container + any input files + the above job.job_info.cmd_duration.
job.job_info.metrics.job_dir_size: Size of the working directory of the job.
job.job_info.job_queue: Resource queue when the job was submitted.
job.job_info.status: Completion status of the job.
job.job_info.time_queued: Time job was sent to HySDS and added to the resource queue.
job.job_info.time_start: Time HySDS started the job i.e before it starts to download the container and any input files. This can also be interpreted as the time a worker picked up the job from the queue.
job.job_info.time_end: Time the job completed, does not include time to upload output products.
We capture the time it took to upload the output product in a different field which is not being currently exported in the metrics.
So the units of my above plot is correct. Turns out the cmd_hours is the result of end - start, and total is queued_time - end time.
Here is the time queued
Now that I know my units are right, @ranchodeluxe did V3 stop ~ June 7th? is it called something else?
My conclusions are:
@mccabete, upon further investigation, I see that "cmd time" is not what I thought it was (I don't know what it is), but "total time" is what we want. The "total time" is indeed the difference between "start time" and "end time", so I made the following updates to the gist:
- Use "total time" instead of "cmd time", which is the total execution time of the algorithm (i.e., this does not include time in the queue)
- Use "start time" instead of "timestamp" for plotting against execution time (total time)
- Select only rows where status is 0, so that we look at times only for successful jobs
Wait, this can't be right. total_time does include the queued time:
@mccabete, the total seconds of job.job_info.time_end - job.job_info.time_start
(end_datetime - start_datetime
) is equal to job.job_info.duration
(execution_time_secs
). This does not include time in the queue.
Members of the team have long complained that the DPS job (version 2) takes a long time to run. I wanted to put some numbers on that over the course of the fire season.
Time of run seems proportional to directory size:
Hopefully, version three will be faster.