art-framework-suite / art

The implementation of the art physics event processing framework
Other
2 stars 7 forks source link

Feature request to keep track of memory, wall time and CPU associated with output files. #110

Open knoepfel opened 3 years ago

knoepfel commented 3 years ago

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/26068 (FNAL account required) Originally created by @hschellman on 2021-07-23 22:58:22


Is it possible to get the memory, wall time and CPU utilization for a job written in the sam (or successor) metadata for an output file? Sounds simple at first, just dump at end of job but if you are writing multiple files to multiple streams it gets complicated as one would need to maintain a separate stats struct for each file that initializes at file open and writes to the metadata at file end. Some of this obviously exists as Art does produce metadata for files.

(I wrote the D0 sam output interface back in the days of the ancients so know you can do this if you can find the file open/close hooks). May have used FORTRAN 2 for all I know.

DUNE is hoping to really instrument our jobs and this would be a great help.

knoepfel commented 3 years ago

Comment by @knoepfel on 2021-07-27 21:23:58


Heidi, we should probably have a meeting to discuss this idea. Some of the metrics are already captured by art, but it's not clear to us what exactly you're after. I'll setup a meeting.

knoepfel commented 3 years ago

Comment by @knoepfel on 2021-08-17 15:51:52


Tom and I met this morning to discuss what is being asked of this proposal. After some discussion, it seemed that what is asked is just enough information persisted to the on-disk SAM metadata to identify a workflow/job that is problematic wrt timing and memory usage. After identifying a problematic job using the SAM metadata information, a user can interactively run the job to debug or profile further. At this point, only overall wall clock time and the max. memory usage would be necessary to persist to the metadata.

Does that sound sensible?

knoepfel commented 3 years ago

Comment by @tomjunk on 2021-08-17 16:28:35


Yes, sounds good. Though the original request was for three numbers -- memory, wall time and CPU time. This doesn't capture all bottlenecks -- for example, some jobs spend a lot of wall time waiting for files before art even starts, but it is a big help, and we cannot ask art to solve that problem. It may be possible to get the art wall time from sam_metadat_dumper's output of start_time and end_time and subtracting them, but a separate field pre-subtracted may be even more convenient. Thanks!

hschellman commented 3 years ago

Yes, although CPU time would provide efficiency information.

So max mem/wall time are the real things, actual CPU time is useful.

And this helps us identify bad workflows but also typical parameters for a given workflow that can inform job placement.

Heidi

On 10/28/21 9:09 AM, Kyle Knoepfel wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

Comment by @knoepfelhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fknoepfel&data=04%7C01%7Cheidi.schellman%40oregonstate.edu%7C4500c8a661b843abd18c08d99a2d48d4%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C637710341545746192%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YK%2B7%2BhE8IFFnhCkYKbq3HuneGuFicX%2Bf6jWfe1zyE%2BU%3D&reserved=0 on 2021-08-17 15:51:52


Tom and I met this morning to discuss what is being asked of this proposal. After some discussion, it seemed that what is asked is just enough information persisted to the on-disk SAM metadata to identify a workflow/job that is problematic wrt timing and memory usage. After identifying a problematic job using the SAM metadata information, a user can interactively run the job to debug or profile further. At this point, only overall wall clock time and the max. memory usage would be necessary to persist to the metadata.

Does that sound sensible?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fart-framework-suite%2Fart%2Fissues%2F110%23issuecomment-953991795&data=04%7C01%7Cheidi.schellman%40oregonstate.edu%7C4500c8a661b843abd18c08d99a2d48d4%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C637710341545756142%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=W50xF5%2B4ZVe55mF5A4LzAdQtjb96S%2F9p50MNpetMnWI%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAIA37DMM33YVHLM7VQ3HC6LUJFYSRANCNFSM5G5IVNCA&data=04%7C01%7Cheidi.schellman%40oregonstate.edu%7C4500c8a661b843abd18c08d99a2d48d4%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C637710341545756142%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bxJqAcdbtwUbbbu7kjLsimZk9rtgJeR5uRNupfY6uP4%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cheidi.schellman%40oregonstate.edu%7C4500c8a661b843abd18c08d99a2d48d4%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C637710341545766103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rJa4V9aHNqTAgOd9YX955mH5Z2pUnyIRXr4HnKfr0AA%3D&reserved=0 or Androidhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cheidi.schellman%40oregonstate.edu%7C4500c8a661b843abd18c08d99a2d48d4%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C637710341545766103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bGQY%2F9bglhSJdw1KFeqRcoNCQp%2FD%2FC3pviH9%2FWmaP0g%3D&reserved=0.

-- Prof. Heidi Schellman Department of Physics, Oregon State University Head, DUNE Collaboration Computing Consortium

https://archive2.iupap.org/position-papers/iupap-statement-on-collaborative-access-to-facilities-and-data/