Does PaRSEC build a DAG and then schedule task or schedule tasks right after tasks are inserted?

ICLDisco / parsec

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.

Other

50 stars 17 forks source link

Does PaRSEC build a DAG and then schedule task or schedule tasks right after tasks are inserted? #516

Open cheng-hsiang-chiu opened 1 year ago

cheng-hsiang-chiu commented 1 year ago

Dear PaRSEC developers,

Recently, I am using PaRSEC for a mini project and have some questions regarding dynamic task discovery(DTD) paradigm. I use DTD to do some simple works. Assume there is an array of six integers called results. The operations on the array is shown in the following figure, DTD question

Task 0 does some computations and writes the value to results[0]. Task 1 reads results[0], does some computations and writes the value to results[1]. Task 2 reads results[1], does some computations and writes the value to results[2]. Task 3 to task 5 follow the same operations. That is, except task 0, all tasks(task 1 to task 5) first read results[index-1] and write to results[index], where index is 1 to 5. The task dependencies are: task 0->task 1->task 2->task 3->task 4->task 5

To implement the above work, I write the following code to create six tasks,

for (int index=0; index<6; ++index){ if(index==0){ parsec_dtd_insert_task( dtd_tp, task_0, 0, PARSEC_DEC_CPU, "task_0", sizeof(int), &index, PARSEC_VALUE, sizeof(int), &results[0], PARSEC_OUTPUT|PARSEC_REF, PARSEC_DTD_ARG_END ); } else{ parsec_dtd_insert_task( dtd_tp, tasks, 0, PARSEC_DEC_CPU, "tasks", sizeof(int), &index, PARSEC_VALUE, sizeof(int), &reslts[index-1], PARSEC_INPUT|PARSEC_REF, sizeof(int*), &results[index], PARSEC_OUTPUT|PARSEC_REF, PARSEC_DTD_ARG_END ); } }

The whole implementation is the attached dtd.txt file.

The following are my questions: Q1. My implementation(dtd.txt) gives me correct results as the sequential implementation(lines 135:143 in dtd.txt) with one core. However, if I specify more core counts, I got wrong results sometimes. How could I fix it? The work runs on a single machine not on a distributed environment.

Q2. When inserting tasks using parsec_dtd_insert_task, PaRSEC will dynamically discover the tasks and figure out the dependencies based on the operation types(e.g. INPUT and OUTPUT). Once a task is created and inserted in parsec taskpool, does PaRSEC runtime immediately schedules the task? For example, in my example, deos PaRSEC immediately schedule task 0 because task 0 has no dependency and then schedule task 1 after task 0 is finished? Or does PaRSEC runtime build a DAG of the six tasks(task 0->task 1->task 2->task 3->task 4->task 5) first and then schedule the tasks?

Thank you so much. dtd.txt

bosilca commented 1 year ago

A correct execution of a correct implementation should not depend on the number of compute resource used. Once the DAG of tasks is built, there are clear way of keeping the result consistent, and I'm pretty sure we follow them. I run your core with the current master version (changing the number of core to -1 to enable parsec detection) and I get correct results.

Hello from task 0 and outputs 100 Hello from task 2, reads 100 , and outputs 200 Hello from task 1, reads 100 , and outputs 100 Hello from task 3, reads 200 , and outputs 600 Hello from task 4, reads 600 , and outputs 2400 Hello from task 5, reads 2400 , and outputs 12000

Passed the computation

The order of the output is scrambled, but that's OK for as long as the values are correct (which they seem to be).
Once a task is taken in consideration by the runtime, it can be started right away if all inputs are available, or be delayed until they become available. If I correctly understand your question, parsec does not need to build the entire graph of tasks before starting to execute them. In fact it depends the order in which you construct the graph and when you start the runtime. To give an example if I build a parsec context, then insert tasks and then start the context, then I first build the DAG and only then I will start executing it. At the contrary if I build a parsec context, then start it and then add tasks, then tasks will start execution (or at least be considered as ready) as soon as all their input are available.

bosilca commented 1 year ago

I suggest you compile parsec in debug mode (--enable-debug to configure, or -DPARSEC_DEBUG to cmake) and then create the file ${HOME}/.parsec/mca-params.conf and add inside

debug_verbose = 100
dtd_traversal_info = 100

You should get a log of output describing what the runtime does, and how it builds and executes the graph of tasks.

QingleiCao commented 1 year ago

Yeah, I can reproduce this, i.e., the wrong result happens sometimes. From dot, there is no dependency between these tasks. So, these tasks are totally independent, and that's why sometimes the result is wrong when using multiple threads.

To generate the dot file, which shows dependencies between tasks, please follow these steps:

1. PARSEC_PROF_GRAPHER ON (CMAKE option)
2. ./exe -dot my_dot 
3. dot -Tpdf -o my_dot.pdf my_dot-0.dot

The reason I guess is the tasks are inserted based on a user-defined array, i.e., int results[6]; in your case. Would you please change this result to parsec-defined data following the examples here (e.g., dtd_test_broadcast.c)?

@bosilca, please point it out if I'm wrong, as I'm not quite familiar with the way in DTD.

bosilca commented 1 year ago

The problem here is not the runtime, but the memory access pattern and their interaction with the hardware cache coherence protocol. The output from task0 is used as an input for task1, but there is no memory ordering constraint between these two tasks, so if they execute on different cores incorrect values would be used. Add a memory barrier at the end of each task (write memory barrier), or at the beginning of a following task (read memory barrier) and your problem will go away.

cheng-hsiang-chiu commented 1 year ago

Hello, thank you for all the great suggestions. I am trying to use dot to verify the dependencies. However, I am not able to generate the dot file. Could you correct me if I am wrong?

I followed the steps suggested by QingleiCao. The steps are used in build directory: 1. ../configure --enable-prof-grapher --with-hwloc --with-mpi 2. ./tests/dsl/dtd/dtd_test_broadcast -dot my_dot 3. dot -Tpdf -o my_dot.pdf my_dot-0.dot

I got an error message "./dtd_test_broadcast: Error: unknown option -dot"

I guess the step 2 is not working properly. How could I fix it? Thank you.

QingleiCao commented 1 year ago

Hi, I'm sorry for the confusion here, but I guess there are some recent updates in PaRSEC about this flag. Please try this instead:

./tests/dsl/dtd/dtd_test_broadcast --mca profile_dot my_dot

cheng-hsiang-chiu commented 1 year ago

Hi, thank you so much for the instant reply @QingleiCao

I tried the following steps in build directory: 1. cmake ../ -DPARSEC_PROF_TRACE=ON -DPARSEC_PROF_GRAPHER=ON 2. ./tests/dsl/dtd/dtd_test_broadcast --mca profile_dot my_dot 3. dot -Tpdf -o my_dot.pdf my_dot-0.dot

For the step 3, this may be a stupid question but I did not find my_dot-0.dot in the build directory. Where is that file generated?

QingleiCao commented 1 year ago

It's my pleasure, and we are always here to help.

The my_dot-0.dot file is generated in the current working directory, so I think it's right under your build directory. Here is an output example.

-rw-r--r--. 1 qcao3 users 362 May 20 10:47 my_dot-0.dot [qcao3@saturn build]$ cat my_dot-0.dot digraph G { task_rank_0_1_0 [shape="polygon",label="<30/0> task_rank_0(_)",tooltip="tpid=1:tcid=1:tcname=task_rank_0:tid=0"]; task_rank_0_1_0 -> parsec_dtd_data_flush_1_1 [label="A=>A",color="#00FF00",style="solid"] parsec_dtd_data_flush_1_1 [shape="polygon",label="<30/0> parsec_dtd_data_flush(_)",tooltip="tpid=1:tcid=0:tcname=parsec_dtd_data_flush:tid=1"]; }

bosilca commented 1 year ago

I confirm, by default the dot file is generated in the same directory where the binary executable is located. If you want to change this default behavior you need to provide an absolute path for the MCA argument. As an example --mca profile_dot /tmp/my_dot will place the dot files in /tmp.

cheng-hsiang-chiu commented 1 year ago

Hello,

I got the dot file generated correctly. Thank you for the assistance.

When I ran my code, I found out the execution time does not improve as the number of threads increases. My code is attached. dtd_test_embarrassing_parallelism.txt The code is to insert dtd tasks and compare the execution time under different core counts. The inserted dtd tasks have no dependencies to each other. So I would expect the more threads running, the better performance I will get. But the execution time I got is the following for 2^21-1 tasks, 1 thread -> 8821.62 us 2 threads -> 6242.82 us 4 threads -> 6155.68 us 8 threads -> 6202.23 us The performance does not scale with the core counts. Does my implementation have bugs? Could you help me? To run the code, simply type in:

./a.out core size

where core is the number of threads and size in my experiment is 20.

devreal commented 1 year ago

I assume you mean ms instead of us, otherwise your per-task runtime would be about 4 nanoseconds (i.e., a dozen or so instructions). However, even if you're looking at 8821.62ms (8.8s) for 2^21 tasks, you end up with 4.2us per task on a single thread, and 2.9us on 2 threads. With more threads you're essentially limited by the speed of the main thread discovering tasks so they likely will likely stay idle. Looking at your work function it seems that your task granularity is in the order of a couple thousand instructions (O(N^3) but N is 16). Unfortunately, DTD is not well optimized for such small tasks. Is this your real use-case or could you work with coarser tasks?

cheng-hsiang-chiu commented 1 year ago

Yes, the time unit is milliseconds. Sorry about the confusion.

This is my real use case in which every task is performing a matrix multiplication of small size, say less than 64. Every task costs several microseconds (< 100 microseconds).

Another question regarding DTD is that we would like to add a range of flows during runtime. Right now, we define the flows as the following,

parsec_dtd_insert_task(
      dtd_tp,
      task_rank_0,
      0,
      PARSEC_DEV_CPU,
      "task_rank_0",
      sizeof(int), &index, PARSEC_VALUE,
      PASSED_BY_REF, PARSEC_DTD_TILE_OF_KEY(A, index), PARSEC_INOUT | TILE_FULL | PARSEC_AFFINITY,
      PARSEC_DTD_ARG_END
);

In some cases, we need to add new flows when certain runtime conditions are satisfied. Take the following c++ pseudo code as an example,

std::vector<task> TASKS={task0, task1, task2, task3};
if (condition)      TASKS.push_back(task4);
parsec_dtd_insert_task(
      dtd_tp,
      TASKS.begin(), TASKS.end(),
      PARSEC_DTD_ARG_END
);

I did not find similar examples in the repository. If there is, could you provide it to me? Thank you.

bosilca commented 1 year ago

@cheng-hsiang-chiu please check this paper to understand the limitations of the insert_task programming paradigm (which is the base of DTD but also of StarPU and other runtimes). It will give you an idea (and a model) to see how far you can go with DTD.

Regarding the number of flows, today we have an assumption that a task (and the associate function) should always have the same number of flows. This means you cannot use GEMM with a variable number of flows. That being said, there is nothing that prevents you from creating GEMM3, GEMM4 and GEMM5 which will be a GEMM with 3, 4 or 5 flows, and handle the flows as you want in the task function.

ICLDisco / parsec

Does PaRSEC build a DAG and then schedule task or schedule tasks right after tasks are inserted? #516

Hello from task 0 and outputs 100 Hello from task 2, reads 100 , and outputs 200 Hello from task 1, reads 100 , and outputs 100 Hello from task 3, reads 200 , and outputs 600 Hello from task 4, reads 600 , and outputs 2400 Hello from task 5, reads 2400 , and outputs 12000

Passed the computation