Closed simleo closed 2 months ago
First thing is knowing what we're going to model. What does each workflow manager already provide? What could be added that's missing, and how hard would be to do that?
More generally, we need to know about any information / logging data (and how structured it is) provided by the framework that runs the workflow, not just resource usage: timestamps, user info, containerization, etc. Another useful categorization is what's available by default and what needs to be explicitly activated (e.g., user info in cwltool).
cwltool tracks peak memory usage; and start & stop tips for jobs & steps We don't currently track peak disk usage, nor CPU time (but both could be added)
Discussed with @ilveroluca earlier this morning. One source of confusion was my suggestion of memoryRequirements or storageRequirements for the property names, which was not consistent with the question "How much memory/cpu/disk was used in run". I have now removed that suggestion from the issue's description.
That's not to say that such indications are not useful: they are quite useful to those who want reproduce the run, since they allow to plan ahead, but they've got nothing to do with what happened during the run. Such indications don't come from the observation of a single run, but rather from the experience (or -- even better -- statistics) of the author(s) or anyone who's worked with the application in various scenarios. They are part of prospective, not retrospective provenance, so we should expect them to come from the workflow's author / maintainer. Indeed, CWL has ResourceRequirement for this purpose. I've now opened #32 to track this.
Here, instead, we are focusing on resource usage information for the specific run described in the crate, such as the peak memory usage mentioned by Michael. This isn't just useful to enrich the metadata about the run: it might be the only hint available in all cases where requirements as discussed above are not available (which I expect to be the majority: even when they are known, the authors might not take the time to provide them); additionally, with a sufficiently large number of runs, it could be used to get a good estimate of the general requirements.
cwltool tracks peak memory usage
Where is that recorded? Is it available from the CWLProv output?
About Nextflow, when trace option is enabled, detailed statistics about memory and CPU usage is gathered from each step of execution. The monitoring is done from the shell script created to execute the workflow step. As this script is written in bash and it also depends on additional tools for the detailed monitoring, mainly ps, grep and awk, not all the container instances allow this detailed statistics gathering.
This is the usually gathered information for each executed step:
nextflow.trace/v2
realtime=1361
%cpu=1380
rchar=19914333
wchar=791
syscr=1252
syscw=30
read_bytes=0
write_bytes=4096
%mem=4
vmem=524224
rss=69416
peak_vmem=568880
peak_rss=112264
vol_ctxt=6
inv_ctxt=444
Sapporo is a Workflow Execution System (WES), so it calls workflow engines (e.g., cwltool and Nextflow) internally. Here, Sapporo and workflow engines are all running as Docker containers. Because Sapporo do not use Docker API (we wanted to support other than Docker), it was difficult to get detailed resource information...
RO-Crate generated by Sapporo is stored following information (information in the Sapporo container):
os
osVersion
cpuArchitecture
cpuCount
totalMemory
freeDiskSpace
uid
git
inDocker
These information are generated in https://github.com/sapporo-wes/sapporo-service/blob/856196864e8ccda8c71bef12e5dcf7d5becb21e6/sapporo/ro_crate.py#L697 . https://github.com/sapporo-wes/tonkaz/blob/a1e4c94c439d6a1d3f9947b519177196f77f1c95/tests/example_crate/trimming.json#L470 is an entity in Sapporo-generated RO-Crate.
Also, we add Log (or provenance) of WES (Sapporo).
cmd
(Execution command of workflow engine)start_time
end_time
run_pid
state
(WES state, e.g., COMPLETE, EXECUTION_ERROR, etc.)stdout
(from Workflow engines)stderr
(from Workflow engines)These information are originally stored in Sapporo as run_dir (https://github.com/sapporo-wes/sapporo-service#run-dir). https://github.com/sapporo-wes/tonkaz/blob/a1e4c94c439d6a1d3f9947b519177196f77f1c95/tests/example_crate/trimming.json#L632 is an entity in Sapporo-generated RO-Crate.
Furthermore, we collect information about each files (input/output file)
contentSize
dateModified
uid
gid
mode
lineCount
sha512
encodingFormat
These information are generated in https://github.com/sapporo-wes/sapporo-service/blob/856196864e8ccda8c71bef12e5dcf7d5becb21e6/sapporo/ro_crate.py#L318 . https://github.com/sapporo-wes/tonkaz/blob/a1e4c94c439d6a1d3f9947b519177196f77f1c95/tests/example_crate/trimming.json#L243 is an entity in Sapporo-generate RO-Crate.
From @rsirvent: example of basic statistics with COMPSs: https://workflowhub.eu/workflows/386 -> App_Profile.json
Sorry I missed this thread. Let me elaborate a bit more what we provide with COMPSs. We have some ways to gather statistics / understand resource usage:
Also, more general information of the resources (how many cores, memory, etc... they have) can be found in COMPSs xml configuration files (resources.xml) (See: https://compss-doc.readthedocs.io/en/stable/Sections/01_Installation/06_Configuration_files.html).
Hope it helps.
One of the main challenges here is providing guidelines that make sense for a wide variety of operating systems and workflow engines. This means we cannot go too deep into details. For instance, Nextflow tracing provides details such as virtual memory vs resident set, but this distinction might not apply to all systems and / or be made by all workflow engines. cwltool and Arvados, for instance, simply give "max memory used", which is sufficiently general.
The Nextflow tracing example shows that this kind of information can be very detailed; for generality and simplicity, I think we should focus on the most important bits, especially for the first release of the profiles. We could represent the following:
Schema.org has things like memoryRequirements for SoftwareApplication, but these are mininum requirements to run the application. We need to describe actual resource usage and tie it to the actions. I could not find anything for this in the current RO-Crate context, so we probably need new terms. I've searched for ontologies for inspiration, but could not find much (e.g. the WICUS hardware specs seems also focused on application requirements).
The simplest approach is to add the properties directly to the action, for instance:
{
"@id": "#action-1",
"@type": "CreateAction",
"memoryUsage": "8.43GB",
"cpuUsage": "140.2%",
"gpuUsage": "70.3%",
"usedCpus": "2",
"usedGpus": "1",
...
}
With CPU / GPU details:
{
"@id": "#action-1",
"@type": "CreateAction",
"memoryUsage": "8.43GB",
"cpuUsage": "140.2%",
"gpuUsage": "70.3%",
"usedCpus": [{"@id": "#cpu-1"}, {"@id": "#cpu-2"}]
"usedGpus": {"@id": "#gpu-1"},
...
},
{
"@id": "#cpu-1",
"@type": "HardwareComponent",
"model": "FooBar 314Pro",
...
},
{
"@id": "#cpu-1",
"@type": "HardwareComponent",
"model": "FooBar XG666",
...
}
Autosubmit (to support RO-Crate soon) keeps track of some variables like memory, CPU, disk, that would fit the model discussed so far, I think... but there are other metrics reported that I am not sure if they would fit in the resource usage here (maybe they'd be reported somewhere else in the RO-Crate archive?).
I think a more flexible approach, allowing for custom values to be added would be useful, from what I understood about the topic so far (still getting familiar with RO-Crate, how to implement it, etc, sorry).
-Bruno
I think a more flexible approach, allowing for custom values to be added would be useful
Several group members were of the same idea at yesterday's meeting, with doubts expressed on the addition of "fixed" properties that might fit poorly the descriptions given by the various engines / systems. Since this will require substantially more thought, I'm removing this issue from the 0.1 milestone.
I'm removing this issue from the 0.1 milestone
What we can easily add for the 0.1 release is a recommendation to add engine-specific logs, reports, traces, etc. to the crate. They can be tied to the corresponding actions easily via about
. Example:
{
"@id": "#action-1",
"@type": "CreateAction",
...
},
{
"@id": "trace-20230120-40360336.txt",
"@type": "File",
"name": "Nextflow trace for action-1",
"conformsTo": "https://www.nextflow.io/docs/latest/tracing.html#trace-report",
"encodingFormat": "text/tab-separated-values",
"about": "#action-1"
}
This is vanilla RO-Crate, so it does not require adding any terms or specific requirements. Moreover, doing this requires very little effort from the crate producer. Having the information there is already quite useful; a future framework for a uniform representation of it would then be an improvement in interoperability.
Great, also declare #trace-report
as in https://stain.github.io/ro-crate/1.2-DRAFT/data-entities#file-format-profiles
{
"@id": "https://www.nextflow.io/docs/latest/tracing.html#trace-report",
"@type": "CreativeWork",
"name": "Nextflow trace report CSV profile"
}
Now that 0.1 is out, recapping the latest discussions on the next steps, the general idea is to use a system based on key-value pairs. So this example:
{
"@id": "#action-1",
"@type": "CreateAction",
"memoryUsage": "8.43GB",
"cpuUsage": "140.2%",
"usedCpus": "2"
}
could become something like:
{
"@id": "#action-1",
"@type": "CreateAction",
"resourceUsage": [
{"@id": "#action-1-memory"},
{"@id": "#action-1-cpu"},
{"@id": "#action-1-nCpu"},
]
},
{
"@id": "#action-1-memory",
"@type": "PropertyValue",
"name": "memory",
"value": "8.43GB",
},
{
"@id": "#action-1-cpu",
"@type": "PropertyValue",
"name": "cpu",
"value": "140.2%",
},
{
"@id": "#action-1-nCpu",
"@type": "PropertyValue",
"name": "nCpu",
"value": "2",
}
Note that ids like #action-1-memory
are not important, they could be UUIDs: the role of the key in PropertyValue
is played by name
.
One advantage of this is that it requires just one extra term to be defined, resourceUsage
. The other main advantage, as we discussed, is that this is extensible by RO-Crate producers, e.g., workflow engines. For instance, Autosubmit could add "energy" or "sypd", and cwltool "peakMemory". The meaning of the corresponding values would depend on the workflow engine, so consumers should be able to determine that information. Looking at the workflow language in the prospective part is not enough, since there can be multiple engines implementing the same language. We need a conformsTo
, which needs a type to be attached to. For instance:
{
"@id": "#action-1",
"@type": "CreateAction",
"resourceUsage": {"@id": "#action-1-ru"}
},
{
"@id": "#action-1-ru",
"@type": "Dataset",
"conformsTo": "http://example.org/cwltool-rocrate-ru-spec",
"variableMeasured": [
{"@id": "#action-1-peakMemory"}
]
},
{
"@id": "#action-1-peakMemory",
"@type": "PropertyValue",
"name": "peakMemory",
"value": "8.43GB"
},
{
"@id": "http://example.org/cwltool-rocrate-ru-spec",
"@type": "CreativeWork",
"name": "cwltool RO-Crate resource usage spec",
"version": "0.1"
}
Note that I've used Dataset
, which seems a very good fit since it has variableMeasured
, which has PropertyValue
in its range. In RO-Crate, Dataset
is also used for directories, but they would not be mixed up since this Dataset
would not be linked to from the root's hasPart
.
Regarding keys:
cwl.ru.peakMemory
?wrroc.ru.memory
. Crate producers would be free to add entries for these or not, but if added they would have be in the expected format (for instance, "memory must mean peak memory and be expressed in bytes" or "memory must mean average memory and be expressed in megabytes", etc.)One advantage of this is that it requires just one extra term to be defined, resourceUsage. The other main advantage, as we discussed, is that this is extensible by RO-Crate producers, e.g., workflow engines. For instance, Autosubmit could add "energy" or "sypd", and cwltool "peakMemory". The meaning of the corresponding values would depend on the workflow engine, so consumers should be able to determine that information. Looking at the workflow language in the prospective part is not enough, since there can be multiple engines implementing the same language. We need a conformsTo, which needs a type to be attached to. For instance:
Sounds good! I've added a note in our merge request to add RO-Crate about testing to record the metrics somehow into the metadata, preferably with a format like this one (although I might push that to after our merge request is merged).
Should we use namespaces for keys? E.g., cwl.ru.peakMemory? Should we define our own "common" set of basic keys to allow some level of cross-engine crate comparison? E.g., wrroc.ru.memory. Crate producers would be free to add entries for these or not, but if added they would have be in the expected format (for instance, "memory must mean peak memory and be expressed in bytes" or "memory must mean average memory and be expressed in megabytes", etc.)
I think we should do both. Use namespaces, and also have a common set of keys. But I think this common set of keys must have a really good description of each key. For instance, for cpus
or cores
, saying something like “number of cores used to run the workflow” can mean different things. For example, for a WMS that uses a remote batch servers like Slurm. Does cpus
mean the cores that I am using to run the WMS process locally, plus the cores used in Slurm? Ditto if we have energy
, total energy or just for running the remote part? If we have one for the disk space used, does it include the logs produced by the WMS to run the workflow, or just the size of the data produced by the workflow (we recently had an issue where a log generated by one of the workflows caused a short interruption of service, but we detected it retroactively), etc.
This way I would be able to use wrroc.ru.cpus
or wrroc.ru.cores
, but if the meaning of cpus
for my WMS differed from the one in the specification, then I'd just go with something like mywms.ru.cpus
.
Do you have any idea if this would be available in tools like CWL Viewer and WorkflowHub.eu? For example, if I want to search all the workflows that used mywms.ru.cpus
greater than 100, could I search for something like mywms.ru.cpus >= 100
.
-Bruno
For example, for a WMS that uses a remote batch servers like Slurm. Does
cpus
mean the cores that I am using to run the WMS process locally, plus the cores used in Slurm?
The resource usage dataset is associated to a specific action. In a Provenance Run Crate, where there's an action for each task, one can record each task's resource usage separately. In a Workflow Run Crate you could record the sum of cores used by all tasks (in the action that represents the workflow run), but for things like usage percentages you probably want the average. In Provenance Run Crates, OTOH, if you have per-task resource usage it's probably better not to record resource usage for the whole workflow run. Resources used by the WMS itself should probably be associated to the corresponding OrganizeAction
, rather than the workflow run.
Do you have any idea if this would be available in tools like CWL Viewer and WorkflowHub.eu?
They're both focused on prospective provenance, so I think it's unlikely.
I think this is almost there -- but https://schema.org/PropertyValue already have a property propertyID for this purpose, so I think that would work better than conformsTo
:
{
"@id": "#action-1-peakMemory",
"@type": "PropertyValue",
"name": "peakMemory",
"propertyID": "https://example.org/cwltool-rocrate-ru-spec#peakMemory",
"value": 8.43,
"unitText": "GiB"
},
(I also added unitText
but if these strings are coming arbitrary from some engine you may not know the unit, in which case it would be part of the value
string) -- there is likewise also https://schema.org/unitCode in case the unit has an identifier.)
It would still make sense to add https://example.org/cwltool-rocrate-ru-spec
(no #
) to the ./
root dataset as a conformsTo
profile and as a contextual entity, as it will hopefully define what those properties mean -- in the most advanced form as DefinedTerm
instances, but probably just mentioned in the HTML.
Would say that propertyID
would be SHOULD and unitText
MAY.
A bit related, similar to what we do in Autosubmit & in our HPC in tracking energy usage, I learned today from a IIIF email about a draft for an HTTP header for carbon emission: https://www.ietf.org/archive/id/draft-martin-http-carbon-emissions-scope-2-00.html
So I believe this shows that there are groups interested in tracking resources like energy consumption, carbon emission, etc., that are different than the most common ones like memory/cpu/disk :+1:
As of today's TC, I have realized https://schema.org/QuantitativeValue could be helpful to describe resource requirements both in prospective and retrospective provenance, due the capability to describe the value and also minimum and maximum ones
As @dgarijo has suggested in the TC chat, we could also allow pointing to Wikidata terms
From today's meeting: https://schema.org/Observation, something that might be useful for cases where you have metrics that are not representing computational resources like CPU or memory, but that are still directly related.
In Autosubmit we have metrics in the prospective provenance (workflow configuration) that tells us how many nodes, memory, CPU per task, etc. we will use. These resource values could exist in either an Autosubmit namespace, or in an Slurm namespace (or both). PyCOMPSs probably uses the same Slurm resources at some point, though not sure if that's available to external users before/after running the workflow.
When an Autosubmit workflow is executed, it will read the resource usage indicated in the configuration, and execute on the HPC. It may use the requested resources, or less. So we are able to get the number of resources used later (even more resources than what we specified, like the energy consumed from Slurm metrics).
But the performance of climate models is not assessed only in terms of CPU's, memory, disk used. There are metrics, like these ones from the CMIP Project (Coupled Model Intercomparison Project)
I will have a look to see if we can map that, from the perspective provenance traces/logs/files with the Schema.org Observation. Ideally users would be able to visualize both computational resource use, and the performance of the model (in terms of these metrics), all from reading the workflow metadata.
Thanks for the tips!!
As promised, taking as example https://workflowhub.eu/workflows/663 :
{
"cloud": {},
"resources": {
"s01r1b41-ib0": {
"implementations": {
"accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans": {
"maxTime": 30,
"executions": 10,
"avgTime": 7,
"minTime": 6
},
"initPointsFrag(INT_T,INT_T)kmeans.KMeans": {
"maxTime": 123,
"executions": 2,
"avgTime": 119,
"minTime": 116
},
"computeNewLocalClusters(INT_T,INT_T,OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans": {
"maxTime": 33,
"executions": 20,
"avgTime": 11,
"minTime": 9
}
}
}
},
"implementations": {
"computeNewLocalClusters(INT_T,INT_T,OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans": {
"maxTime": 33,
"executions": 20,
"avgTime": 11,
"minTime": 9
},
"initPointsFrag(INT_T,INT_T)kmeans.KMeans": {
"maxTime": 123,
"executions": 2,
"avgTime": 119,
"minTime": 116
},
"accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans": {
"maxTime": 30,
"executions": 10,
"avgTime": 7,
"minTime": 6
}
}
}
The first part is statistics per resource (s01r1b41-ib0, only one worker used in this run) and per method (accumulate, initPointsFrag and computeNewLocalClusters). And the final part is the global statistics, aggregated for all resources (in this case, it's the same, since only one worker was used).
So, as a test, the last piece:
"accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans": {
"maxTime": 30,
"executions": 10,
"avgTime": 7,
"minTime": 6
}
Could be represented as:
{
"@id": "#maxTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "maxTime",
"value": 30,
"unitText": "ms"
},
{
"@id": "#executions_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "executions",
"value": 10,
"unitText": "times"
},
{
"@id": "#avgTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "avgTime",
"value": 7,
"unitText": "ms"
},
{
"@id": "#minTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "minTime",
"value": 6,
"unitText": "ms"
},
I don't know how to use the "propertyID" term.
Let me know how is it going so far.
@rsirvent the propertyID
entries should point to unique URLs representing resource identifiers specific to COMPSs. For instance, you could create a new compss
namespace in ro-terms. Using that, and QUDT for units, the result would be:
{
"@id": "#COMPSs_Workflow_Run_Crate_marenostrum4_SLURM_JOB_ID_30650595",
"@type": "CreateAction",
"resourceUsage": [
{"@id": "#maxTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"},
{"@id": "#executions_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"},
{"@id": "#avgTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"},
{"@id": "#minTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans"}
],
...
},
{
"@id": "#maxTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "maxTime",
"propertyID": "https://w3id.org/ro/terms/compss#maxTime",
"unitCode": "https://qudt.org/vocab/unit/MilliSEC",
"value": "30"
},
{
"@id": "#executions_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "executions",
"propertyID": "https://w3id.org/ro/terms/compss#executions",
"value": "10"
},
{
"@id": "#avgTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "avgTime",
"propertyID": "https://w3id.org/ro/terms/compss#avgTime",
"value": "7",
"unitCode": "https://qudt.org/vocab/unit/MilliSEC"
},
{
"@id": "#minTime_accumulate(OBJECT_T,OBJECT_T,OBJECT_T,OBJECT_T)kmeans.KMeans",
"@type": "PropertyValue",
"name": "minTime",
"propertyID": "https://w3id.org/ro/terms/compss#minTime",
"value": "6",
"unitCode": "https://qudt.org/vocab/unit/MilliSEC"
}
Note that executions
does not have a unitCode
since it's adimensional (the same would happen for percentages).
https://github.com/ResearchObject/workflow-run-crate/pull/60 was merged as nextflow example.
Need to formulate some text for the pages on how to do this with PropertyValue - documenting propertyID and unitCode
This relates more to provenance ro-crate as it probably better to describe a particular prcoess. But some statistics could make sense overall as well as for the engine execution, but could otherwise be difficult to aggregate at workflow level.
How much memory/cpu/disk was used in run?