Open cheng-hsiang-chiu opened 1 year ago
Hello from task 0 and outputs 100 Hello from task 2, reads 100 , and outputs 200 Hello from task 1, reads 100 , and outputs 100 Hello from task 3, reads 200 , and outputs 600 Hello from task 4, reads 600 , and outputs 2400 Hello from task 5, reads 2400 , and outputs 12000
Passed the computation
The order of the output is scrambled, but that's OK for as long as the values are correct (which they seem to be).
I suggest you compile parsec in debug mode (--enable-debug to configure, or -DPARSEC_DEBUG to cmake) and then create the file ${HOME}/.parsec/mca-params.conf
and add inside
debug_verbose = 100
dtd_traversal_info = 100
You should get a log of output describing what the runtime does, and how it builds and executes the graph of tasks.
Yeah, I can reproduce this, i.e., the wrong result happens sometimes. From dot
, there is no dependency between these tasks. So, these tasks are totally independent, and that's why sometimes the result is wrong when using multiple threads.
To generate the dot
file, which shows dependencies between tasks, please follow these steps:
1. PARSEC_PROF_GRAPHER ON (CMAKE option)
2. ./exe -dot my_dot
3. dot -Tpdf -o my_dot.pdf my_dot-0.dot
The reason I guess is the tasks are inserted based on a user-defined array, i.e., int results[6];
in your case. Would you please change this result
to parsec-defined data following the examples here (e.g., dtd_test_broadcast.c)?
@bosilca, please point it out if I'm wrong, as I'm not quite familiar with the way in DTD.
The problem here is not the runtime, but the memory access pattern and their interaction with the hardware cache coherence protocol. The output from task0 is used as an input for task1, but there is no memory ordering constraint between these two tasks, so if they execute on different cores incorrect values would be used. Add a memory barrier at the end of each task (write memory barrier), or at the beginning of a following task (read memory barrier) and your problem will go away.
Hello, thank you for all the great suggestions. I am trying to use dot to verify the dependencies. However, I am not able to generate the dot file. Could you correct me if I am wrong?
I followed the steps suggested by QingleiCao.
The steps are used in build directory:
1. ../configure --enable-prof-grapher --with-hwloc --with-mpi
2. ./tests/dsl/dtd/dtd_test_broadcast -dot my_dot
3. dot -Tpdf -o my_dot.pdf my_dot-0.dot
I got an error message "./dtd_test_broadcast: Error: unknown option -dot"
I guess the step 2 is not working properly. How could I fix it? Thank you.
Hi, I'm sorry for the confusion here, but I guess there are some recent updates in PaRSEC about this flag. Please try this instead:
./tests/dsl/dtd/dtd_test_broadcast --mca profile_dot my_dot
Hi, thank you so much for the instant reply @QingleiCao
I tried the following steps in build directory:
1. cmake ../ -DPARSEC_PROF_TRACE=ON -DPARSEC_PROF_GRAPHER=ON
2. ./tests/dsl/dtd/dtd_test_broadcast --mca profile_dot my_dot
3. dot -Tpdf -o my_dot.pdf my_dot-0.dot
For the step 3, this may be a stupid question but I did not find my_dot-0.dot in the build directory. Where is that file generated?
It's my pleasure, and we are always here to help.
The my_dot-0.dot
file is generated in the current working directory, so I think it's right under your build directory. Here is an output example.
-rw-r--r--. 1 qcao3 users 362 May 20 10:47 my_dot-0.dot
[qcao3@saturn build]$ cat my_dot-0.dot digraph G { task_rank_0_1_0 [shape="polygon",label="<30/0> task_rank_0(_)",tooltip="tpid=1:tcid=1:tcname=task_rank_0:tid=0"]; task_rank_0_1_0 -> parsec_dtd_data_flush_1_1 [label="A=>A",color="#00FF00",style="solid"] parsec_dtd_data_flush_1_1 [shape="polygon",label="<30/0> parsec_dtd_data_flush(_)",tooltip="tpid=1:tcid=0:tcname=parsec_dtd_data_flush:tid=1"]; }
I confirm, by default the dot file is generated in the same directory where the binary executable is located. If you want to change this default behavior you need to provide an absolute path for the MCA argument. As an example --mca profile_dot /tmp/my_dot
will place the dot files in /tmp
.
Hello,
I got the dot file generated correctly. Thank you for the assistance.
When I ran my code, I found out the execution time does not improve as the number of threads increases. My code is attached. dtd_test_embarrassing_parallelism.txt The code is to insert dtd tasks and compare the execution time under different core counts. The inserted dtd tasks have no dependencies to each other. So I would expect the more threads running, the better performance I will get. But the execution time I got is the following for 2^21-1 tasks, 1 thread -> 8821.62 us 2 threads -> 6242.82 us 4 threads -> 6155.68 us 8 threads -> 6202.23 us The performance does not scale with the core counts. Does my implementation have bugs? Could you help me? To run the code, simply type in:
./a.out core size
where core is the number of threads and size in my experiment is 20.
I assume you mean ms instead of us, otherwise your per-task runtime would be about 4 nanoseconds (i.e., a dozen or so instructions). However, even if you're looking at 8821.62ms (8.8s) for 2^21 tasks, you end up with 4.2us per task on a single thread, and 2.9us on 2 threads. With more threads you're essentially limited by the speed of the main thread discovering tasks so they likely will likely stay idle. Looking at your work
function it seems that your task granularity is in the order of a couple thousand instructions (O(N^3) but N is 16). Unfortunately, DTD is not well optimized for such small tasks. Is this your real use-case or could you work with coarser tasks?
Yes, the time unit is milliseconds. Sorry about the confusion.
This is my real use case in which every task is performing a matrix multiplication of small size, say less than 64. Every task costs several microseconds (< 100 microseconds).
Another question regarding DTD is that we would like to add a range of flows during runtime. Right now, we define the flows as the following,
parsec_dtd_insert_task(
dtd_tp,
task_rank_0,
0,
PARSEC_DEV_CPU,
"task_rank_0",
sizeof(int), &index, PARSEC_VALUE,
PASSED_BY_REF, PARSEC_DTD_TILE_OF_KEY(A, index), PARSEC_INOUT | TILE_FULL | PARSEC_AFFINITY,
PARSEC_DTD_ARG_END
);
In some cases, we need to add new flows when certain runtime conditions are satisfied. Take the following c++ pseudo code as an example,
std::vector<task> TASKS={task0, task1, task2, task3};
if (condition) TASKS.push_back(task4);
parsec_dtd_insert_task(
dtd_tp,
TASKS.begin(), TASKS.end(),
PARSEC_DTD_ARG_END
);
I did not find similar examples in the repository. If there is, could you provide it to me? Thank you.
@cheng-hsiang-chiu please check this paper to understand the limitations of the insert_task
programming paradigm (which is the base of DTD but also of StarPU and other runtimes). It will give you an idea (and a model) to see how far you can go with DTD.
Regarding the number of flows, today we have an assumption that a task (and the associate function) should always have the same number of flows. This means you cannot use GEMM with a variable number of flows. That being said, there is nothing that prevents you from creating GEMM3, GEMM4 and GEMM5 which will be a GEMM with 3, 4 or 5 flows, and handle the flows as you want in the task function.
Dear PaRSEC developers,
Recently, I am using PaRSEC for a mini project and have some questions regarding dynamic task discovery(DTD) paradigm. I use DTD to do some simple works. Assume there is an array of six integers called results. The operations on the array is shown in the following figure,
Task 0 does some computations and writes the value to results[0]. Task 1 reads results[0], does some computations and writes the value to results[1]. Task 2 reads results[1], does some computations and writes the value to results[2]. Task 3 to task 5 follow the same operations. That is, except task 0, all tasks(task 1 to task 5) first read results[index-1] and write to results[index], where index is 1 to 5. The task dependencies are: task 0->task 1->task 2->task 3->task 4->task 5
To implement the above work, I write the following code to create six tasks,
for (int index=0; index<6; ++index){ if(index==0){ parsec_dtd_insert_task( dtd_tp, task_0, 0, PARSEC_DEC_CPU, "task_0", sizeof(int), &index, PARSEC_VALUE, sizeof(int), &results[0], PARSEC_OUTPUT|PARSEC_REF, PARSEC_DTD_ARG_END ); } else{ parsec_dtd_insert_task( dtd_tp, tasks, 0, PARSEC_DEC_CPU, "tasks", sizeof(int), &index, PARSEC_VALUE, sizeof(int), &reslts[index-1], PARSEC_INPUT|PARSEC_REF, sizeof(int*), &results[index], PARSEC_OUTPUT|PARSEC_REF, PARSEC_DTD_ARG_END ); } }
The whole implementation is the attached dtd.txt file.
The following are my questions: Q1. My implementation(dtd.txt) gives me correct results as the sequential implementation(lines 135:143 in dtd.txt) with one core. However, if I specify more core counts, I got wrong results sometimes. How could I fix it? The work runs on a single machine not on a distributed environment.
Q2. When inserting tasks using parsec_dtd_insert_task, PaRSEC will dynamically discover the tasks and figure out the dependencies based on the operation types(e.g. INPUT and OUTPUT). Once a task is created and inserted in parsec taskpool, does PaRSEC runtime immediately schedules the task? For example, in my example, deos PaRSEC immediately schedule task 0 because task 0 has no dependency and then schedule task 1 after task 0 is finished? Or does PaRSEC runtime build a DAG of the six tasks(task 0->task 1->task 2->task 3->task 4->task 5) first and then schedule the tasks?
Thank you so much. dtd.txt