Extract C/D/E-specific features from workflow runs

bentsherman commented 4 years ago

Aside from the nextflow trace there are a number of application-specific features that could be used for resource prediction but may require some tooling to extract:

profiling with gprof, nvprof
output logs
file / directory stats of outputs

We may need to develop a schema or process for extracting these features and incorporating them with the nextflow trace.

bentsherman commented 4 years ago

Another consideration here is features about the input data, which will definitely inform the resource usage in most cases. There may not be a uniform way to collect these features but we can outline a few specific cases:

GEMmaker

input size is the total length of all SRR files in the input file
local FASTQ files can be measured in size by line count

HemeLB

GMY file is binary but there may be some tools to measure things like site count and sparsity?
also could use file size

KINC

plaintext GEM can be read to determine sample count, gene count
binary GEM... use kinc dump?

Based on these examples, I think we need to define a process that occurs outside of the actual application execution. So if you use tesseract to mine the performance data of your nextflow pipeline, you must define a script for that pipeline that takes an input file and outputs the features for that input file, in CSV format. Later on, the trace / conditions data can be augmented with these input features by matching the input filename or using some ID.

bentsherman commented 4 years ago

For collecting environment features, the environment script provided by SC might be a good start. Appending features to the input conditions seems to work well so I think we can keep extending that.

bentsherman commented 4 years ago

In particular capturing dependency versions would be important. For example switching from CUDA 9 to CUDA 10 seems to resolve some mysterious issues with the V100s on Palmetto.

bentsherman commented 4 years ago

Pretty much done. Nextflow pipelines can be annotated with #TRACE directives, the aggregate script parses the directives from the command script / execution log and appends them to the nextflow trace.

Profilers like gprof / ncu are probably off the table. We should not need this low-level profiling data to achieve 20% relative error All application-specific input/output features can be captured by #TRACE directives.

As for environment features, the main thing is system performance metrics. Since the main outputs we are focused on are runtime, memory usage, and disk usage, the main system metrics we can try are CPU speed, memory speed, and disk speed. Simple benchmarks should be able to capture those even if a rough estimate. These metrics can be computed once (or periodically) and appended to any workflow runs where appropriate. In the inference setting, these metrics would be fetched based on the node type that a user selects in a form.

Software dependencies do not seem feasible to include as input features. They would all be categorical variables, realistically they wouldn't change very often. We could probably address issues like the CUDA version issue from the anomaly detection perspective.

bentsherman / tesseract

Extract C/D/E-specific features from workflow runs #3