ResearchObject / workflow-run-crate

Workflow Run RO-Crate profile
https://www.researchobject.org/workflow-run-crate/
Apache License 2.0
8 stars 9 forks source link

CQ2 - Resource usage #10

Closed simleo closed 2 months ago

simleo commented 2 years ago

How much memory/cpu/disk was used in run?

simleo commented 1 year ago

First thing is knowing what we're going to model. What does each workflow manager already provide? What could be added that's missing, and how hard would be to do that?

More generally, we need to know about any information / logging data (and how structured it is) provided by the framework that runs the workflow, not just resource usage: timestamps, user info, containerization, etc. Another useful categorization is what's available by default and what needs to be explicitly activated (e.g., user info in cwltool).

mr-c commented 1 year ago

cwltool tracks peak memory usage; and start & stop tips for jobs & steps We don't currently track peak disk usage, nor CPU time (but both could be added)

simleo commented 1 year ago

Discussed with @ilveroluca earlier this morning. One source of confusion was my suggestion of memoryRequirements or storageRequirements for the property names, which was not consistent with the question "How much memory/cpu/disk was used in run". I have now removed that suggestion from the issue's description.

That's not to say that such indications are not useful: they are quite useful to those who want reproduce the run, since they allow to plan ahead, but they've got nothing to do with what happened during the run. Such indications don't come from the observation of a single run, but rather from the experience (or -- even better -- statistics) of the author(s) or anyone who's worked with the application in various scenarios. They are part of prospective, not retrospective provenance, so we should expect them to come from the workflow's author / maintainer. Indeed, CWL has ResourceRequirement for this purpose. I've now opened #32 to track this.

Here, instead, we are focusing on resource usage information for the specific run described in the crate, such as the peak memory usage mentioned by Michael. This isn't just useful to enrich the metadata about the run: it might be the only hint available in all cases where requirements as discussed above are not available (which I expect to be the majority: even when they are known, the authors might not take the time to provide them); additionally, with a sufficiently large number of runs, it could be used to get a good estimate of the general requirements.

simleo commented 1 year ago

cwltool tracks peak memory usage

Where is that recorded? Is it available from the CWLProv output?

jmfernandez commented 1 year ago

About Nextflow, when trace option is enabled, detailed statistics about memory and CPU usage is gathered from each step of execution. The monitoring is done from the shell script created to execute the workflow step. As this script is written in bash and it also depends on additional tools for the detailed monitoring, mainly ps, grep and awk, not all the container instances allow this detailed statistics gathering.

This is the usually gathered information for each executed step:

nextflow.trace/v2
realtime=1361
%cpu=1380
rchar=19914333
wchar=791
syscr=1252
syscw=30
read_bytes=0
write_bytes=4096
%mem=4
vmem=524224
rss=69416
peak_vmem=568880
peak_rss=112264
vol_ctxt=6
inv_ctxt=444
suecharo commented 1 year ago

Sapporo is a Workflow Execution System (WES), so it calls workflow engines (e.g., cwltool and Nextflow) internally. Here, Sapporo and workflow engines are all running as Docker containers. Because Sapporo do not use Docker API (we wanted to support other than Docker), it was difficult to get detailed resource information...

RO-Crate generated by Sapporo is stored following information (information in the Sapporo container):

These information are generated in https://github.com/sapporo-wes/sapporo-service/blob/856196864e8ccda8c71bef12e5dcf7d5becb21e6/sapporo/ro_crate.py#L697 . https://github.com/sapporo-wes/tonkaz/blob/a1e4c94c439d6a1d3f9947b519177196f77f1c95/tests/example_crate/trimming.json#L470 is an entity in Sapporo-generated RO-Crate.

Also, we add Log (or provenance) of WES (Sapporo).

These information are originally stored in Sapporo as run_dir (https://github.com/sapporo-wes/sapporo-service#run-dir). https://github.com/sapporo-wes/tonkaz/blob/a1e4c94c439d6a1d3f9947b519177196f77f1c95/tests/example_crate/trimming.json#L632 is an entity in Sapporo-generated RO-Crate.

Furthermore, we collect information about each files (input/output file)

These information are generated in https://github.com/sapporo-wes/sapporo-service/blob/856196864e8ccda8c71bef12e5dcf7d5becb21e6/sapporo/ro_crate.py#L318 . https://github.com/sapporo-wes/tonkaz/blob/a1e4c94c439d6a1d3f9947b519177196f77f1c95/tests/example_crate/trimming.json#L243 is an entity in Sapporo-generate RO-Crate.

simleo commented 1 year ago

From @rsirvent: example of basic statistics with COMPSs: https://workflowhub.eu/workflows/386 -> App_Profile.json

rsirvent commented 1 year ago

Sorry I missed this thread. Let me elaborate a bit more what we provide with COMPSs. We have some ways to gather statistics / understand resource usage:

Also, more general information of the resources (how many cores, memory, etc... they have) can be found in COMPSs xml configuration files (resources.xml) (See: https://compss-doc.readthedocs.io/en/stable/Sections/01_Installation/06_Configuration_files.html).

Hope it helps.

simleo commented 1 year ago

What to represent?

One of the main challenges here is providing guidelines that make sense for a wide variety of operating systems and workflow engines. This means we cannot go too deep into details. For instance, Nextflow tracing provides details such as virtual memory vs resident set, but this distinction might not apply to all systems and / or be made by all workflow engines. cwltool and Arvados, for instance, simply give "max memory used", which is sufficiently general.

The Nextflow tracing example shows that this kind of information can be very detailed; for generality and simplicity, I think we should focus on the most important bits, especially for the first release of the profiles. We could represent the following:

How to represent it?

Schema.org has things like memoryRequirements for SoftwareApplication, but these are mininum requirements to run the application. We need to describe actual resource usage and tie it to the actions. I could not find anything for this in the current RO-Crate context, so we probably need new terms. I've searched for ontologies for inspiration, but could not find much (e.g. the WICUS hardware specs seems also focused on application requirements).

The simplest approach is to add the properties directly to the action, for instance:

{
    "@id": "#action-1",
    "@type": "CreateAction",
    "memoryUsage": "8.43GB",
    "cpuUsage": "140.2%",
    "gpuUsage": "70.3%",
    "usedCpus": "2",
    "usedGpus": "1",
    ...
}

With CPU / GPU details:

{
    "@id": "#action-1",
    "@type": "CreateAction",
    "memoryUsage": "8.43GB",
    "cpuUsage": "140.2%",
    "gpuUsage": "70.3%",
    "usedCpus": [{"@id": "#cpu-1"}, {"@id": "#cpu-2"}]
    "usedGpus": {"@id": "#gpu-1"},
    ...
},
{
    "@id": "#cpu-1",
    "@type": "HardwareComponent",
    "model": "FooBar 314Pro",
    ...
},
{
    "@id": "#cpu-1",
    "@type": "HardwareComponent",
    "model": "FooBar XG666",
    ...
}
kinow commented 1 year ago

Autosubmit (to support RO-Crate soon) keeps track of some variables like memory, CPU, disk, that would fit the model discussed so far, I think... but there are other metrics reported that I am not sure if they would fit in the resource usage here (maybe they'd be reported somewhere else in the RO-Crate archive?).

I think a more flexible approach, allowing for custom values to be added would be useful, from what I understood about the topic so far (still getting familiar with RO-Crate, how to implement it, etc, sorry).

-Bruno

simleo commented 1 year ago

I think a more flexible approach, allowing for custom values to be added would be useful

Several group members were of the same idea at yesterday's meeting, with doubts expressed on the addition of "fixed" properties that might fit poorly the descriptions given by the various engines / systems. Since this will require substantially more thought, I'm removing this issue from the 0.1 milestone.

simleo commented 1 year ago

I'm removing this issue from the 0.1 milestone

What we can easily add for the 0.1 release is a recommendation to add engine-specific logs, reports, traces, etc. to the crate. They can be tied to the corresponding actions easily via about. Example:

{
    "@id": "#action-1",
    "@type": "CreateAction",
    ...
},
{
    "@id": "trace-20230120-40360336.txt",
    "@type": "File",
    "name": "Nextflow trace for action-1",
    "conformsTo": "https://www.nextflow.io/docs/latest/tracing.html#trace-report",
    "encodingFormat": "text/tab-separated-values",
    "about": "#action-1"
}

This is vanilla RO-Crate, so it does not require adding any terms or specific requirements. Moreover, doing this requires very little effort from the crate producer. Having the information there is already quite useful; a future framework for a uniform representation of it would then be an improvement in interoperability.

stain commented 1 year ago

Great, also declare #trace-report as in https://stain.github.io/ro-crate/1.2-DRAFT/data-entities#file-format-profiles

 {
  "@id": "https://www.nextflow.io/docs/latest/tracing.html#trace-report",
  "@type": "CreativeWork",
  "name": "Nextflow trace report CSV profile"
 }
simleo commented 1 year ago

Now that 0.1 is out, recapping the latest discussions on the next steps, the general idea is to use a system based on key-value pairs. So this example:

{
    "@id": "#action-1",
    "@type": "CreateAction",
    "memoryUsage": "8.43GB",
    "cpuUsage": "140.2%",
    "usedCpus": "2"
}

could become something like:

{
    "@id": "#action-1",
    "@type": "CreateAction",
    "resourceUsage": [
        {"@id": "#action-1-memory"},
        {"@id": "#action-1-cpu"},
        {"@id": "#action-1-nCpu"},
    ]
},
{
    "@id": "#action-1-memory",
    "@type": "PropertyValue",
    "name": "memory",
    "value": "8.43GB",
},
{
    "@id": "#action-1-cpu",
    "@type": "PropertyValue",
    "name": "cpu",
    "value": "140.2%",
},
{
    "@id": "#action-1-nCpu",
    "@type": "PropertyValue",
    "name": "nCpu",
    "value": "2",
}

Note that ids like #action-1-memory are not important, they could be UUIDs: the role of the key in PropertyValue is played by name.

One advantage of this is that it requires just one extra term to be defined, resourceUsage. The other main advantage, as we discussed, is that this is extensible by RO-Crate producers, e.g., workflow engines. For instance, Autosubmit could add "energy" or "sypd", and cwltool "peakMemory". The meaning of the corresponding values would depend on the workflow engine, so consumers should be able to determine that information. Looking at the workflow language in the prospective part is not enough, since there can be multiple engines implementing the same language. We need a conformsTo, which needs a type to be attached to. For instance:

{
    "@id": "#action-1",
    "@type": "CreateAction",
    "resourceUsage": {"@id": "#action-1-ru"}
},
{
    "@id": "#action-1-ru",
    "@type": "Dataset",
    "conformsTo": "http://example.org/cwltool-rocrate-ru-spec",
    "variableMeasured": [
        {"@id": "#action-1-peakMemory"}
    ]
},
{
    "@id": "#action-1-peakMemory",
    "@type": "PropertyValue",
    "name": "peakMemory",
    "value": "8.43GB"
},
{
    "@id": "http://example.org/cwltool-rocrate-ru-spec",
    "@type": "CreativeWork",
    "name": "cwltool RO-Crate resource usage spec",
    "version": "0.1"
}

Note that I've used Dataset, which seems a very good fit since it has variableMeasured, which has PropertyValue in its range. In RO-Crate, Dataset is also used for directories, but they would not be mixed up since this Dataset would not be linked to from the root's hasPart.

Regarding keys:

kinow commented 1 year ago

One advantage of this is that it requires just one extra term to be defined, resourceUsage. The other main advantage, as we discussed, is that this is extensible by RO-Crate producers, e.g., workflow engines. For instance, Autosubmit could add "energy" or "sypd", and cwltool "peakMemory". The meaning of the corresponding values would depend on the workflow engine, so consumers should be able to determine that information. Looking at the workflow language in the prospective part is not enough, since there can be multiple engines implementing the same language. We need a conformsTo, which needs a type to be attached to. For instance:

Sounds good! I've added a note in our merge request to add RO-Crate about testing to record the metrics somehow into the metadata, preferably with a format like this one (although I might push that to after our merge request is merged).

Should we use namespaces for keys? E.g., cwl.ru.peakMemory? Should we define our own "common" set of basic keys to allow some level of cross-engine crate comparison? E.g., wrroc.ru.memory. Crate producers would be free to add entries for these or not, but if added they would have be in the expected format (for instance, "memory must mean peak memory and be expressed in bytes" or "memory must mean average memory and be expressed in megabytes", etc.)

I think we should do both. Use namespaces, and also have a common set of keys. But I think this common set of keys must have a really good description of each key. For instance, for cpus or cores, saying something like “number of cores used to run the workflow” can mean different things. For example, for a WMS that uses a remote batch servers like Slurm. Does cpus mean the cores that I am using to run the WMS process locally, plus the cores used in Slurm? Ditto if we have energy, total energy or just for running the remote part? If we have one for the disk space used, does it include the logs produced by the WMS to run the workflow, or just the size of the data produced by the workflow (we recently had an issue where a log generated by one of the workflows caused a short interruption of service, but we detected it retroactively), etc.

This way I would be able to use wrroc.ru.cpus or wrroc.ru.cores, but if the meaning of cpus for my WMS differed from the one in the specification, then I'd just go with something like mywms.ru.cpus.

Do you have any idea if this would be available in tools like CWL Viewer and WorkflowHub.eu? For example, if I want to search all the workflows that used mywms.ru.cpus greater than 100, could I search for something like mywms.ru.cpus >= 100.

-Bruno

simleo commented 1 year ago

For example, for a WMS that uses a remote batch servers like Slurm. Does cpus mean the cores that I am using to run the WMS process locally, plus the cores used in Slurm?

The resource usage dataset is associated to a specific action. In a Provenance Run Crate, where there's an action for each task, one can record each task's resource usage separately. In a Workflow Run Crate you could record the sum of cores used by all tasks (in the action that represents the workflow run), but for things like usage percentages you probably want the average. In Provenance Run Crates, OTOH, if you have per-task resource usage it's probably better not to record resource usage for the whole workflow run. Resources used by the WMS itself should probably be associated to the corresponding OrganizeAction, rather than the workflow run.

Do you have any idea if this would be available in tools like CWL Viewer and WorkflowHub.eu?

They're both focused on prospective provenance, so I think it's unlikely.

stain commented 1 year ago

I think this is almost there -- but https://schema.org/PropertyValue already have a property propertyID for this purpose, so I think that would work better than conformsTo:

{
    "@id": "#action-1-peakMemory",
    "@type": "PropertyValue",
    "name": "peakMemory",
    "propertyID": "https://example.org/cwltool-rocrate-ru-spec#peakMemory",
    "value": 8.43,
    "unitText": "GiB"
},

(I also added unitText but if these strings are coming arbitrary from some engine you may not know the unit, in which case it would be part of the value string) -- there is likewise also https://schema.org/unitCode in case the unit has an identifier.)

It would still make sense to add https://example.org/cwltool-rocrate-ru-spec (no #) to the ./ root dataset as a conformsTo profile and as a contextual entity, as it will hopefully define what those properties mean -- in the most advanced form as DefinedTerm instances, but probably just mentioned in the HTML.

Would say that propertyID would be SHOULD and unitText MAY.

kinow commented 1 year ago

A bit related, similar to what we do in Autosubmit & in our HPC in tracking energy usage, I learned today from a IIIF email about a draft for an HTTP header for carbon emission: https://www.ietf.org/archive/id/draft-martin-http-carbon-emissions-scope-2-00.html

So I believe this shows that there are groups interested in tracking resources like energy consumption, carbon emission, etc., that are different than the most common ones like memory/cpu/disk :+1:

jmfernandez commented 1 year ago

As of today's TC, I have realized https://schema.org/QuantitativeValue could be helpful to describe resource requirements both in prospective and retrospective provenance, due the capability to describe the value and also minimum and maximum ones

stain commented 1 year ago

for unitCode we can use QUDT:

jmfernandez commented 1 year ago

As @dgarijo has suggested in the TC chat, we could also allow pointing to Wikidata terms

kinow commented 7 months ago

From today's meeting: https://schema.org/Observation, something that might be useful for cases where you have metrics that are not representing computational resources like CPU or memory, but that are still directly related.


In Autosubmit we have metrics in the prospective provenance (workflow configuration) that tells us how many nodes, memory, CPU per task, etc. we will use. These resource values could exist in either an Autosubmit namespace, or in an Slurm namespace (or both). PyCOMPSs probably uses the same Slurm resources at some point, though not sure if that's available to external users before/after running the workflow.

When an Autosubmit workflow is executed, it will read the resource usage indicated in the configuration, and execute on the HPC. It may use the requested resources, or less. So we are able to get the number of resources used later (even more resources than what we specified, like the energy consumed from Slurm metrics).

But the performance of climate models is not assessed only in terms of CPU's, memory, disk used. There are metrics, like these ones from the CMIP Project (Coupled Model Intercomparison Project)

image

I will have a look to see if we can map that, from the perspective provenance traces/logs/files with the Schema.org Observation. Ideally users would be able to visualize both computational resource use, and the performance of the model (in terms of these metrics), all from reading the workflow metadata.

Thanks for the tips!!

rsirvent commented 7 months ago

As promised, taking as example https://workflowhub.eu/workflows/663 :

{
 "cloud": {},
 "resources": {
  "s01r1b41-ib0": {
   "implementations": {
    "accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans": {
     "maxTime": 30,
     "executions": 10,
     "avgTime": 7,
     "minTime": 6
    },
    "initPointsFrag(INT_T,INT_T)kmeans.KMeans": {
     "maxTime": 123,
     "executions": 2,
     "avgTime": 119,
     "minTime": 116
    },
    "computeNewLocalClusters(INT_T,INT_T,OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans": {
     "maxTime": 33,
     "executions": 20,
     "avgTime": 11,
     "minTime": 9
    }
   }
  }
 },
 "implementations": {
  "computeNewLocalClusters(INT_T,INT_T,OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans": {
   "maxTime": 33,
   "executions": 20,
   "avgTime": 11,
   "minTime": 9
  },
  "initPointsFrag(INT_T,INT_T)kmeans.KMeans": {
   "maxTime": 123,
   "executions": 2,
   "avgTime": 119,
   "minTime": 116
  },
  "accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans": {
   "maxTime": 30,
   "executions": 10,
   "avgTime": 7,
   "minTime": 6
  }
 }
}

The first part is statistics per resource (s01r1b41-ib0, only one worker used in this run) and per method (accumulate, initPointsFrag and computeNewLocalClusters). And the final part is the global statistics, aggregated for all resources (in this case, it's the same, since only one worker was used).

So, as a test, the last piece:

"accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans": {
   "maxTime": 30,
   "executions": 10,
   "avgTime": 7,
   "minTime": 6
  }

Could be represented as:

{
    "@id": "#maxTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
    "@type": "PropertyValue",
    "name": "maxTime",
    "value": 30,
    "unitText": "ms"
},
{
    "@id": "#executions_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
    "@type": "PropertyValue",
    "name": "executions",
    "value": 10,
    "unitText": "times"
},
{
    "@id": "#avgTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
    "@type": "PropertyValue",
    "name": "avgTime",
    "value": 7,
    "unitText": "ms"
},
{
    "@id": "#minTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
    "@type": "PropertyValue",
    "name": "minTime",
    "value": 6,
    "unitText": "ms"
},

I don't know how to use the "propertyID" term.

Let me know how is it going so far.

simleo commented 7 months ago

@rsirvent the propertyID entries should point to unique URLs representing resource identifiers specific to COMPSs. For instance, you could create a new compss namespace in ro-terms. Using that, and QUDT for units, the result would be:

{
    "@id": "#COMPSs_Workflow_Run_Crate_marenostrum4_SLURM_JOB_ID_30650595",
    "@type": "CreateAction",
    "resourceUsage": [
        {"@id": "#maxTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"},
        {"@id": "#executions_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"},
        {"@id": "#avgTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"},
        {"@id": "#minTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"}
    ],
    ...
},
{
    "@id": "#maxTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
    "@type": "PropertyValue",
    "name": "maxTime",
    "propertyID": "https://w3id.org/ro/terms/compss#maxTime",
    "unitCode": "https://qudt.org/vocab/unit/MilliSEC",
    "value": "30"
},
{
    "@id": "#executions_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
    "@type": "PropertyValue",
    "name": "executions",
    "propertyID": "https://w3id.org/ro/terms/compss#executions",
    "value": "10"
},
{
    "@id": "#avgTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
    "@type": "PropertyValue",
    "name": "avgTime",
    "propertyID": "https://w3id.org/ro/terms/compss#avgTime",
    "value": "7",
    "unitCode": "https://qudt.org/vocab/unit/MilliSEC"
},
{
    "@id": "#minTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
    "@type": "PropertyValue",
    "name": "minTime",
    "propertyID": "https://w3id.org/ro/terms/compss#minTime",
    "value": "6",
    "unitCode": "https://qudt.org/vocab/unit/MilliSEC"
}

Note that executions does not have a unitCode since it's adimensional (the same would happen for percentages).

stain commented 3 months ago

https://github.com/ResearchObject/workflow-run-crate/pull/60 was merged as nextflow example.

Need to formulate some text for the pages on how to do this with PropertyValue - documenting propertyID and unitCode

This relates more to provenance ro-crate as it probably better to describe a particular prcoess. But some statistics could make sense overall as well as for the engine execution, but could otherwise be difficult to aggregate at workflow level.