jerowe commented 4 years ago

We need a parser that can take as it's input a text file with bash commands and create a Dask HighLevelGraph.

Jobs are normally grouped together by analysis type (QC, filter, preprocess, align) and/or compute resources.

Simple Example with job only dependencies. In this example, all tasks in job_1 must complete before any in job_2 can complete.

#HPC jobname=job_1
#HPC walltime=10:00:00
#HPC mem=32GB
#HPC cpus_per_task=1
job_1_task_1
job_1_task_2
job_1_task_3
job_1_task_4

#HPC jobname=job_2
#HPC walltime=10:00:00
#HPC mem=32GB
#HPC cpus_per_task=1
#HPC deps=job_1
job_2_task_1
job_2_task_2
job_2_task_3
job_2_task_4

Example with Task Dependencies

I think Dharhas would have some really good insight into how this should be done.

When I would submit something like this on SLURM I would use array dependencies. Essentially, instead of submitting a single job N times you submit a job once that is itself a for loop that executes N times.

Jobs can depend on jobs, and tasks within the job can depend upon tasks in another job.

I don't think that this needs to be translated specifically into job arrays, but can be handled by the Dask Scheduler itself.

#HPC jobname=job_1
#HPC walltime=10:00:00
#HPC mem=32GB
#HPC cpus_per_task=1

#HPC task_tags=task_1
job_1_task_1
#HPC task_tags=task_2
job_1_task_2
#HPC task_tags=task_3
job_1_task_3
#HPC task_tags=task_4
job_1_task_4

#HPC jobname=job_2
#HPC deps=job_1
#HPC walltime=10:00:00
#HPC mem=32GB
#HPC cpus_per_task=1

#HPC task_tags=task_1
job_2_task_1
#HPC task_tags=task_2
job_2_task_2
#HPC task_tags=task_3
job_2_task_3
#HPC task_tags=task_1
job_2_task_4

#HPC jobname=job_3
#HPC deps=job_3
#HPC walltime=10:00:00
#HPC mem=32GB
#HPC cpus_per_task=1

job_3_task_1
job_3_task_2
job_3_task_3
job_3_task_4

dharhas commented 4 years ago

So the above essentially is defines 3 separate job submission to slurm, each for 10 hrs.

Options:

a) Do exactly the above. Slurm can already handle this why do we need Dask.

b) Use Slurm and Dask. Use slurm to manage the interjob dependencies, each job will launch a dask cluster and run the intrajob tasks in parallel. Not sure that this has much benefit.

c) Only use slurm to launch a dask cluster of sufficient size and duration. Handle everything else through the dask scheduler including dependencies.

To me (c) makes the most sense and is the most portable to other systems. The only issue is if you have wallclock limitations. And in that case, looking at this issue might make sense (essentially keep starting new workers when old ones expire due to wallclock expiration):

https://github.com/dask/dask-jobqueue/issues/122

Also if we decide on (3), we can also consider dropping the format listed above which is very slurm specific and use a nicer format to describe dependencies.

i.e. we separate the definition of resources from the definition of the workflow.

jerowe commented 4 years ago

@dharhas I like option 3 the best too. I think if we can keep things platform (HPC, Kubernetes, ECS) agnostic we should.

I'm all for a new a syntax. In a perfect world I think this could get split into 2 components:

A very simple syntax to define jobs / deps / resources, and that would be for the people that aren't so computational and this would just take care of it for them.
A nicer developers API that is super shiny and nice

Adam-D-Lewis commented 4 years ago

So if I understand correctly, it seems like the workflow file wouldn't need #HPC walltime, HPC mem=32GB, or HPC cpus_per_task=1, since those details will be specified separately, right?

While we still might change the workflow file format (see https://github.com/Quansight-Labs/dask-jobqueue/issues/3), I want to see if I'm on the same page as others as to how the dask graph will look. My question is at the bottom.

Workflow File Example

Workflow File

#HPC jobname=job0
#HPC task_tags=task_0
job1_0_bash_script_string
#HPC task_tags=task_1
job1_1_bash_script_string

#HPC jobname=job1
#HPC deps=job0

#HPC task_tags=task_0
job1_0_bash_script_string

#HPC jobname=job2
#HPC deps=job1
job2_0_bash_script_string
job2_1_bash_script_string
job2_2_bash_script_string

Corresponding Dask High Level Graph

layers = {
   'job0': {
            ('job0', 0): (subprocess.run, job0_0_bash_script_string),
            ('job0', 1): (subprocess.run, job0_1_bash_script_string),
           }

   'job1': {
            ('job1', 0): (subprocess.run, job1_0_bash_script_string)
           }

   'job2': {
             ('job2', 0): (subprocess.run, job2_0_bash_script_string),
             ('job2', 1): (subprocess.run, job2_1_bash_script_string),
             ('job2', 2): (subprocess.run, job2_2_bash_script_string),
           }
}

dependencies = {'job0': set(),
                'job1': {'job0'},
                'job2': {'job1'}}

graph = HighLevelGraph(layers, dependencies)

Is this how others imagine the Dask graphs looking with subprocess.run functions to be run on all the shell string commands?

dharhas commented 4 years ago

Rename this issue (See https://github.com/Quansight-Labs/dask-jobqueue/issues/4)

I think we should rename to something like Create a simple text format that can represent a workflow and be converted to a Dask DAG.

dharhas commented 4 years ago

@jerowe would using a yaml style config similar to this be too high a barrier for our target audience. The advantage is parsing, error checking etc:

example_dag1:
  default_args:
    owner: 'example_owner'
    start_date: 2018-01-01
  schedule_interval: '0 3 * * *'
  description: 'this is an example dag!'
  tasks:
    task_1:
      operator: airflow.operators.bash_operator.BashOperator
      bash_command: 'echo 1'
    task_2:
      operator: airflow.operators.bash_operator.BashOperator
      bash_command: 'echo 2'
      dependencies: [task_1]
    task_3:
      operator: airflow.operators.bash_operator.BashOperator
      bash_command: 'echo 3'
      dependencies: [task_1]

We wouldn't need all the fields & complexity but using something like yaml avoids having to build a parser. plus you get syntax highlighting etc when you built it and we could build a linter easily to check the DAG and display it. Your earlier example would end up being:

my_job_name:
  resources:
     cpu: 10
     queue: normal 
  tasks:
    read_data:
      bash_commands: 
        - bash_script_string_1
        - bash_script_string_2
    process_data:
      bash_commands:
         - bash_script_string_3
      dependencies: [read_data]
    plot_data:
      bash_commands:
         - bash_script_string_4
     dependencies: [process_data]
     description: plots the data and saves it to disk

jerowe commented 4 years ago

@balast thanks!

I totally understand the advantages of using a YAML file over a plain text. YAML is tricky for people with no computational background because they will try to do things like open Word or Google docs and work with them and then the whitespace gets all messed up.

Would it be possible to have something that could read .ini AND .yaml files? .ini is easier for non techy folks because there are no weird whitespace issues.

Secondly, I know I included a link to Airflow in that doc and probably confused the issue. I wasn't meaning that we should use Airflow to create DAGs. I just like their bash executor because it buffers the output. Some programs create a LONG stream of stdout/stderr.

dharhas commented 4 years ago

@jerowe

The whitespace problem is real for folks using non yaml aware editors. .ini has its drawbacks but is doable. How about toml (https://github.com/toml-lang/toml), its the new format being used by python setup files and is a nice mix of readability and whitespace/indentation is optional. It kinda takes the best parts of yaml and ini.

example:

# This is a TOML document.

title = "TOML Example"

[owner]
name = "Tom Preston-Werner"
dob = 1979-05-27T07:32:00-08:00 # First class dates

[database]
server = "192.168.1.1"
ports = [ 8001, 8001, 8002 ]
connection_max = 5000
enabled = true

[servers]

  # Indentation (tabs and/or spaces) is allowed but not required
  [servers.alpha]
  ip = "10.0.0.1"
  dc = "eqdc10"

  [servers.beta]
  ip = "10.0.0.2"
  dc = "eqdc10"

[clients]
data = [ ["gamma", "delta"], [1, 2] ]

# Line breaks are OK when inside arrays
hosts = [
  "alpha",
  "omega"
]

Regarding the BashWrapper. Buffering the output is a good reason to use one. My point was that this is two issues that are not related.

issue a) Define a workflow with a text file issue b) Have a nice way of calling a bash cmd

Adam-D-Lewis commented 4 years ago

I'm copying over from https://github.com/Quansight-Labs/dask-jobqueue/issues/3#issuecomment-604429704 since that issue was closed as a duplicate.

From @jerowe:

I think something like this would be more natural to HPC users -

- jobs
  - name: job_1
  - resources:
# I got this from the Dask JobQueue how this works
#        cluster = PBSCluster(  # <-- scheduler started here
#          cores=24,
#          memory='100GB',
#          shebang='#!/usr/bin/env zsh',  # default is bash
#          processes=6,
#          local_directory='$TMPDIR',
#          resource_spec='select=1:ncpus=24:mem=100GB',
#          queue='regular',
#          project='my-project',
#          walltime='02:00:00',
#          )
      - mem: 100GB
      - processes: 6
      - walltime: '02:00:00'
      - queue: 'regular'
      # etc
  - tasks:
      - shell: echo "hello world from task 1"
      - shell: echo "hello world from task 2"
  - name: job_2
  - depends:
      - job_1
  - tasks:
      - shell: echo "hello world from task 1"
      - shell: echo "hello world from task 2"

I think this page is the most straightforward for a translation of JobQueue -> Dask Worker.

Then I think the dask-worker process is smart enough to figure out that it has 6 processes and run the 6 tasks concurrently.

I do really like the idea of having the shell there. If there was interest there could be python executors in there as well.

Adam-D-Lewis commented 4 years ago

I'm copying over from https://github.com/Quansight-Labs/dask-jobqueue/issues/3#issuecomment-604458486 since that issue was closed as a duplicate.
From @dharhas:

So yaml is easy to read but you need to be whitespace aware, toml and ini are harder to read and construct but are not whitespace sensitive. JSON is hard to read and has lots of brackets you have to match correctly.

@jerowe in your jobqueue code above you have 6 processes with 4 threads each for a total of 24 tasks that can happen at the same time. Depending on your workload type the performance will change depending on whether you use more threads (i.e. 1 process/24 threads) or more processes (i.e. 24 processes each with 1 thread). Since we are just calling bash scripts I don't think it will matter.

Adam-D-Lewis commented 4 years ago

I've attached an example toml workflow file. It seems like it would be a bit cumbersome to allow tasks to be dependent on other tasks in the same job. Is that necessary? Comments?

# Workflow File

[resources]
cluster_type = "SLURMCluster" # Could be any cluster type supported by Job Queue, but obviously must match the Workload Manager used by the HPC Cluster.
n_workers = 10

[resources.worker]
# per worker params (any of those supported by that cluster type in Jobqueue)
cores = 8
processes = 4
memory = "16GB"
project = "MyProject"
Walltime = "01:00:00"
queue = "normal"

[job1]
[job1.tasks]
task1="bash_script_string_1"
task2="bash_script_string_2"
task3="bash_script_string_3"
task4="bash_script_string_4"

[job2]
depends = ["job1", "anotherjob"]
[job2.tasks]
task1="bash_script_string_5"
task2="bash_script_string_6"
task3="bash_script_string_7"
task4="bash_script_string_8"

jerowe commented 4 years ago

[job2.tasks]
task1="bash_script_string_5"
task2="bash_script_string_6"
task3="bash_script_string_7"
task4="bash_script_string_8"

Do the tasks need to have a name? Is that so it can be more easily read in by the toml parser?

It would be difficult for some researchers to deal with quotes, particularly if they have a command that then has quotes in it. Would they need to escape the quotes inside and/or know the difference between " and ' ?

Something like this would be ideal, but I understand that getting those commands read into an array would be tricky and require additional parsing, which is another thing to maintain. Thoughts?

[job2.tasks]
bash_script_string_5
# This is one bash command
bash_script_string_6 \
   -with-some-args
bash_script_string_7
bash_script_string_8

It seems like it would be a bit cumbersome to allow tasks to be dependent on other tasks in the same job.

Nope, it's not necessary to have tasks depend upon other tasks in a job, but it would be a very nice to have if tasks could depend upon tasks in another job. That would be a nice to have not a necessity.

In this format could the resources.worker be a default parameter and per job?

[job1]
[resources.worker]
cores = 8
processes = 4
memory = "56GB"
project = "MyOtherProject"
walltime = "01:00:00"
queue = "normal"
[job1.tasks]
bash_script_string_5
bash_script_string_6 \
   -with-some-args
bash_script_string_7
bash_script_string_8

Thanks! I think this is really coming along!

jerowe commented 4 years ago

Is there any utility in dask to get logs per task? Some of these tasks will produce a whole lot of stdout/stderr, and having a log per task would be very helpful.

some_defined_log_dir/submission_time/job_id/task_1_log.txt
some_defined_log_dir/submission_time/job_id/task_2_log.txt

If this is too much of a pain slate it for the nice to have category.

dharhas commented 4 years ago

TOML requires quotes, if this is a deal breaker we need to look elsewhere. YAML is the main format that doesn't require quotes.

toml is a (key, value) format so a plain list will not work. the closest option would be a comma delimited array.

[job1]
tasks = [
"bash_script_string_5",
"bash_script_string_6 -with-some-args",
"bash_script_string_7",
"bash_script_string_8",
]

logs per task -> this could be easily made part of the wrapper that runs the shell commands.

dharhas commented 4 years ago

Actually it looks like .ini/.cfg format might give you what you need. We would have to do some after the fact parsing but would fit your input needs. i.e. we would need to split strings on comma, \n etc, convert strings to ints/floats etc.

Two options. The first requires that we put a shell command on a single line. We also have to indent the tasks but the number of spaces, tabs and alignment doesn't matter, it just needs to be indented.

# Workflow File

[resources]
cluster_type = SLURMCluster # Could be any cluster type supported by Job Queue, but obviously must match the Workload Manager used by the HPC Cluster.
n_workers = 10
# per worker params (any of those supported by that cluster type in Jobqueue)
cores = 8
processes = 4
memory = 16GB
project = MyProject
Walltime = 01:00:00
queue = normal

[job1]
tasks =
  echo "my name is dharhas" 
  mv '~/Downloads/SYNTH (1).zip' SYNTH.zip 

[job2]
depends = job1
tasks = 
    bash_script_string_5
  bash_script_string_6 lotsa args lotsa args lotsa args lotsa args lotsa args lotsa args lotsa args
    bash_script_string_7

[job3]
depends = job1, job2
task1 = bash_script_string_7

The second is more verbose but would allow multiline bash commands

# Workflow File

[resources]
cluster_type = SLURMCluster # Could be any cluster type supported by Job Queue, but obviously must match the Workload Manager used by the HPC Cluster.
n_workers = 10
# per worker params (any of those supported by that cluster type in Jobqueue)
cores = 8
processes = 4
memory = 16GB
project = MyProject
Walltime = 01:00:00
queue = normal

[job1]
tasks =
  echo "my name is dharhas" 
  mv '~/Downloads/SYNTH (1).zip' SYNTH.zip 

[job2]
depends = job1
task1 = bash_script_string_5
task2 = bash_script_string_6 lotsa 
  args lotsa args lotsa args lotsa args lotsa args lotsa args lotsa args
task3 =  bash_script_string_7

[job3]
depends = job1, job2
task1 = bash_script_string_8

dharhas commented 4 years ago

INI is a very ill-defined format but we can stick to the portion supported by python3 scroll down on this page - https://docs.python.org/3/library/configparser.html

Adam-D-Lewis commented 4 years ago

Do the tasks need to have a name? Is that so it can be more easily read in by the toml parser?

I thought we would want the tasks to have names so researchers can get more value from Dask's profiling tool. The alternative would be to assign names like (jobname, 0), (jobname, 1), etc. or maybe infer some reasonable name e.g. by looking at the first word of the bash command.

It would be difficult for some researchers to deal with quotes, particularly if they have a command that then has quotes in it. Would they need to escape the quotes inside and/or know the difference between " and ' ?

Good point, I didn't think of this. TOML supports multi line literal strings with three single quotes. That likely wouldn't interfere with the bash commands anymore.

task5 = '''bash_command &&
another_bash_command'''

Something like this would be ideal, but I understand that getting those commands read into an array would be tricky and require additional parsing, which is another thing to maintain. Thoughts?
[job2.tasks]
bash_script_string_5
# This is one bash command
bash_script_string_6 \
   -with-some-args
bash_script_string_7
bash_script_string_8

Let me know if you think the additional resolution in the profiler is worth it or not.

It seems like it would be a bit cumbersome to allow tasks to be dependent on other tasks in the same job.

Nope, it's not necessary to have tasks depend upon other tasks in a job, but it would be a very nice to have if tasks could depend upon tasks in another job. That would be a nice to have not a necessity.

With using each job as a layer in a Dask High Level Graph, the job dependencies would take care of the case where job2_taskX is dependent on job1_taskY. I'm pretty sure what would happen is all of job1 tasks would need to be completed before any of job2's tasks start. If this isn't acceptable, then I think we'd need to use the lower level task graph.

In this format could the resources.worker be a default parameter and per job?

[job1]
[resources.worker]
cores = 8
processes = 4
memory = "56GB"
project = "MyOtherProject"
walltime = "01:00:00"
queue = "normal"
[job1.tasks]
bash_script_string_5
bash_script_string_6 \
   -with-some-args
bash_script_string_7
bash_script_string_8

So you want to be able to have different worker specs per job? Is it okay to use the same worker specs for each job and simply scale up to more or fewer workers per job?

dharhas commented 4 years ago

The resource spec and the task description need to be separate sections or not in the same file, otherwise you make the configuration too specific to one machine.

Adam-D-Lewis commented 4 years ago

@dharhas If we use the .ini files and the first option you showed, could we separate the bash commands with ";"? E.g.

# Workflow File

[resources]
cluster_type = SLURMCluster # Could be any cluster type supported by Job Queue, but obviously must match the Workload Manager used by the HPC Cluster.
n_workers = 10
# per worker params (any of those supported by that cluster type in Jobqueue)
cores = 8
processes = 4
memory = 16GB
project = MyProject
Walltime = 01:00:00
queue = normal

[job1]
tasks =
  echo "my name is dharhas";
  mv '~/Downloads/SYNTH (1).zip' SYNTH.zip 

[job2]
depends = job1
tasks = 
    bash_script_string_5;
    bash_script_string_6 lotsa args
    lotsa args
    lotsa args 
    lotsa args
    lotsa args;
    bash_script_string_7

[job3]
depends = job1, job2
task1 = bash_script_string_7

If so, the user is forced to use "&&" instead of ";" it they want to execute multiple bash commands in a single task.

jerowe commented 4 years ago

Worker Resources

So you want to be able to have different worker specs per job? Is it okay to use the same worker specs for each job and simply scale up to more or fewer workers per job?

Yes to the first. I'm not sure what the second means, and since I don't it probably means I'm trying to hammer dask into doing something that could be done better. ;-) Could you explain?

Jobs can have vastly different computational needs. The first step in the analysis might be something crazy, then some QC that's not so crazy, so on and so forth.

# Default resources
[resources]
cluster_type = SLURMCluster # Could be any cluster type supported by Job Queue, but obviously must match the Workload Manager used by the HPC Cluster.
n_workers = 10
# per worker params (any of those supported by that cluster type in Jobqueue)
cores = 8
processes = 4
memory = 16GB
project = MyProject
Walltime = 01:00:00
queue = normal

[job1]
[job1.resources.worker]
cores = 8
processes = 4
memory = "56GB"
project = "MyOtherProject"
walltime = "01:00:00"
queue = "normal"
[job1.tasks]
...

Editing this because I had a thought. Is each [job] a separate SLURM submission or no?

Task Definition

I thought we would want the tasks to have names so researchers can get more value from Dask's profiling tool. The alternative would be to assign names like (jobname, 0), (jobname, 1), etc. or maybe infer some reasonable name e.g. by looking at the first word of the bash command.

That's a good point, and it's good for dependency declaration too I'd imagine. So let's shelve having a parser there.

Good point, I didn't think of this. TOML supports multi-line literal strings with three single quotes. That likely wouldn't interfere with the bash commands anymore.

I think this option looks the easiest rather than having an arbitrary separator.

task5 = '''bash_command &&
another_bash_command'''

Do task names have to unique across jobs?

Adam-D-Lewis commented 4 years ago

Worker Resources

So you want to be able to have different worker specs per job? Is it okay to use the same worker specs for each job and simply scale up to more or fewer workers per job?

Yes to the first. I'm not sure what the second means, and since I don't it probably means I'm trying to hammer dask into doing something that could be done better. ;-) Could you explain?

One thing that is making communication more difficult is that the word "job" is overloaded in our discussion. The way I understand what we've decided to do is we'll use a python script to:

Start an HPC (SLURM, PBS, etc.) cluster.
Provision Dask workers (Each worker is an HPC "job", but only defines the amount of computational resources available, NOT the work to be done.)
Parse the workflow file (likely to use TOML format at this point)
Convert parsed workflow file to a Dask High Level Graph
Execute the Dask High Level Graph using the Dask Workers

This video https://www.youtube.com/watch?v=FXsgmwpRExM&t=340s (watch til 10:32) shows essentially the same as steps 1,2, and 5, and helps show what a "job" actually is. A job is only what resources (and the wall time for those resources) you are provisioning from the Supercomputer. Essentially, a dask worker, and an HPC "job" are the same thing.

The word "job" in the workflow file is misleading, because it's not a job in the sense that HPC folks think about jobs. A better name might be "step" in the workflow file. After you provision your computational resources, the dask scheduler is going to assign each of the tasks from each of the "steps" to the available computational resources (the "jobs" or workers provisioned earlier) while managing dependencies appropriately. What is important, is that at least as many tasks as you have cores on a single worker can run simultaneously without running out of RAM. (E.g. if you have 36 cores per worker, then at least 36 tasks need to be able to run simultaneously without running out of RAM)

Jobs can have vastly different computational needs. The first step in the analysis might be something crazy, then some QC that's not so crazy, so on and so forth.

So we were discussing should we 1) simply change the amount of jobs/workers we have per step or should we 2)change the amount of resources on each job/worker. I believe option 1 is more natural in Dask.

Adam-D-Lewis commented 4 years ago

It was good talking to @jerowe yesterday, as it helped me understand the requirements more fully. One of the requirements she mentioned is that different tasks be run with workers with different specs (cores, memory, etc.) rather than simply scaling the number of workers up. She mentioned this is because some tasks require large amounts of memory, and others require much less. We could use the largest worker node needed on every task, but she is worried utilization of the nodes would be low, and would be a poor use of HPC resources (HPC admins wouldn't be happy).

I've hit a road block finding a way to include different worker types within a single cluster. There is an open issue requesting this feature, but no plans have been made to implement it (https://github.com/dask/distributed/issues/2118).

I could create various clusters with different specs (e.g. large_cluster, and small_cluster), as mentioned https://github.com/dask/dask-jobqueue/issues/381#issuecomment-584008839, but I don't believe the different clusters can share a scheduler or a task graph so it seems very awkward to execute the workflow file since dependencies would not lie within a single task graph.

I'll keep working on this, but any thoughts on how to proceed are welcome.

jerowe commented 4 years ago

hi @balast how is this coming along?

Adam-D-Lewis commented 4 years ago

I tried the following:

I created an overarching Localcluster cluster that calculated the task graph. I passed around multiple SlurmClusters as a global variable, and each task in the task graph was a function which ran the bash script on the appropriate SlurmCluster.

That approach had some disadvantages, and generally felt like I was trying to hammer a peg into a square hole. The disadvantages were that the LocalCluster could not be a multiprocess cluster (multithreaded was okay) because it required serialization of the cluster object which I couldn't get to work. I think a single process might be okay since it's just solving the task graph and not running the actual jobs directly, but even when I did use a single process for the LocalCluster, my test script failed about half the times it ran because of a hard coded 3 second timeout in the distributed library.

I searched around a bit more after that wasn't working out, and found https://github.com/dask/dask-jobqueue/issues/378 which seems like a better approach. The author of the issue extended the SGECluster and SGEJob classes to get behavior similar to what we need. I'm working on something similar for SLURM clusters, and hope to have something ready to test early next week.

dharhas commented 4 years ago

@jerowe meta question. Why are we trying to use Dask here and why are the existing slurm queues insufficient (at least when we are talking about HPC)?

Adam-D-Lewis commented 4 years ago

@jerowe I wasn't able to work on this much last week, but I got my extension of SlurmCluster working today which can scale up and down workers with differing specs. I'll push it to the repo later tonight or tomorrow. I still need to go back to parsing the toml file, but that shouldn't take too long to do.

Adam-D-Lewis commented 4 years ago

@jerowe I'm looking for some clarification on the target audience for this project. Is it okay if this project only supports HPC users (people using a cluster supported by dask-jobqueue) or is that too limiting? That is the easiest thing to do given the workaround I had to do to allow dask to support various worker specs in a single cluster. If we want to support non-HPC users also, one way that might be done is by simply limiting those use cases to a single worker spec.

jerowe commented 4 years ago

@balast that's fine. Keep it super simple to start with. If anyone wants more I'm sure they will have no problem telling us all about it. ;-)

jerowe commented 4 years ago

And for the target audience -

Picture a biologist who can use excel, but is not a software engineer. He hasn't ever opened up a terminal or worked on a remote server.

You want to get them from reading a methods section of a paper to cutting and pasting the commands into a text (toml) document and then being able to submit that to the cluster with as little pain as possible.

Another way to think of this is as training software. It doesn't have to (and it probably really shouldn't) take care of every use case ever. When users get to anything that complex they will move onto snakemake or nextflow.

dharhas commented 4 years ago

my bad, accidentally closed the issue.

@jerowe if this is the use case then I suggest we don't worry as much about the idea of multiple resource types. That is a very HPC specific and advanced use case. lets just assume a homogeneous cluster and then it doesn't matter whether they run on HPC or cloud or laptop.

Adam-D-Lewis commented 4 years ago

Okay, since we no longer have the multiple resource type requirement, I went ahead and wrote a method to parse a workflow file (also included) to a task graph. Very much a work in progress, but the basic functionality is there on the toml_parse branch.

SultanOrazbayev commented 3 years ago

Not sure if I understand correctly the problem you are solving, but could it be addressed by combining snakemake (which constructs its own DAG that is aware of HPC resource requirements) and dask?

Quansight-Labs / dask-jobqueue

Create a simple text format that can represent a workflow and be converted to a Dask DAG #1

Workflow File Example

Workflow File

Corresponding Dask High Level Graph

Worker Resources

Task Definition

Worker Resources