awslabs / amazon-omics-tools

Apache License 2.0
17 stars 6 forks source link

Adding a script to collapse repetitive tasks. #33

Open jwarnn opened 4 months ago

jwarnn commented 4 months ago

It would be more useful to know the average and all related tasks as it runs on different data. A script, like below, would be simple or maybe giving a flag in the omics-run-analyzer.py to combine all similar tasks using similar code.

import pandas as pd
import sys

in_file = str(sys.argv[1])
out_file = str(sys.argv[2])

run=pd.read_csv(in_file, usecols=['name', 'runningSeconds','cpuUtilization', 'memoryUtilization','cpusReserved', 
                'cpusMaximum', 'cpusAverage','memoryReservedGiB', 'memoryMaximumGiB', 'memoryAverageGiB'])

run = run.drop([0])

tasks = []
for index, row in run.iterrows():
    tasks.append(row['name'].split()[0])
tasks = list(set(tasks))

tasks_df= pd.DataFrame()
for task in tasks:
    tmp = run[run['name'].str.contains(task)]
    tasks_df = tasks_df.append(tmp.mean(numeric_only=True).to_frame(task, ).T.astype(object))

tasks_df.to_csv(out_file)
wleepang commented 4 months ago

@jwarnn - I'm not sure I understand the use case fully.

Say a workflow uses a scatter-gather pattern. Are you asking for generating aggregate statistics on shards of the scatter? If so, I'm curious of the utility of such statistics if shards are operating on fairly independent data - e.g. 1 shard per chromosome.

jwarnn commented 4 months ago

The workflows that we are operating are more based on a sample level and works on sequencing reads or FASTQ files. So for each sample you can see a scatter gather approach for most of the bioinformatics tools/software that is ran on the sequencing data for the sample except there is also some chaining together of tools as one relies on the output of the other. Or another way to put it 20 separate and/or connected tasks are being ran on say 40 unique datasets of the same type; in our case FASTQ files; so each task will run 40 times on different data. Aggregating the data for the same task and also looking at the maximum gives me a clearer idea of how resource allocation is working. Then I can go into my workflow definition and adjust requested resource with the nexflow.config file or some similar file. So some type of argument that could be sent to the script that tells it that aggregate data would useful for users using workflow that operate in a similar fashion would be helpful.