Open BalyshevArtem opened 1 year ago
with #10132 for mobilenet_v1_1.0_224.circle.txt and for random h5
file with 50 num records I had the following result (host has 8 threads)(using ./time
):
without parallel recording:
real 8m54,355s
user 8m54,286s
sys 0m0,048s
with parallel recording:
real 1m30,052s
user 10m8,754s
sys 0m0,671s
Looks very good feature! Before landing this feature, we need some tests.
Option to turn off OpenMP for supporting other platforms e.g., Arm. @mhs4670go already left a comment. PTAL https://github.com/Samsung/ONE/pull/10132#issuecomment-1327465549.
Value check. Min/max values recorded in sequential/parallel settings must be the same. The easiest way to add a test would be to add pota_parallel_record_minmax_test
in pota-quantization-value-test
.
FYI, I forked https://github.com/BalyshevArtem/ONE/tree/parallel_record_minmax and ran pota-quantization-value-test
with parallel_record
(by modifing test_record_minmax.sh
), but it failed.
104/111 Test #98: pota_record_minmax_test ....................***Failed 5.19 sec
-- Found RECORD-MINMAX: /home/hjj/ONE/build/release/compiler/record-minmax/record-minmax
-- Found CIRCLE_TENSORDUMP: /home/hjj/ONE/build/release/compiler/circle-tensordump/circle-tensordump
-- Found workdir: /home/hjj/ONE/build/release/compiler/common-artifacts
~/ONE/build/release/compiler/common-artifacts ~/ONE/build/release/compiler/pota-quantization-value-test
~/ONE/build/release/compiler/pota-quantization-value-test
FAILED
- Add_002.layer.uint8
- Add_002.channel.int16
...
FYI, I forked https://github.com/BalyshevArtem/ONE/tree/parallel_record_minmax and ran
pota-quantization-value-test
withparallel_record
(by modifingtest_record_minmax.sh
), but it failed
Hmmm, I checked it now locally and it passed
artem_bal@a-balyshev02:~/ONE/infra/nncc/cmake-build-debug/compiler/pota-quantization-value-test$ ctest -R pota_record_minmax_test
Test project /home/artem_bal/ONE/infra/nncc/cmake-build-debug/compiler/pota-quantization-value-test
1/1 Test #2: pota_record_minmax_test .......... Passed 12.31 sec
100% tests passed, 0 tests failed out of 1
if I checked for example Add_002.layer.uint8_record_minmax.log
I found:
Using parallel recording
So, it was parallel recording mode.
Please prepare a full DRAFT to show all the changes
@BalyshevArtem Ah sorry. I misused the API (I just gave --parallel_record
, but the correct interface was --parallel_record true
). The test passed in my environment too. Thanks.
I just gave --parallel_record, but the correct interface was --parallel_record true
I would be better to check value of --parallel_record
is valid or not; if only true
is valid then if not true
then throw error.
record-minnmax
uses lots of modules and I am not sure all these modules are thread-safe.
at least I do NOT garentee the modules are thread-safe.
@jinevening , are you sure you want to go with this feature?
are you sure you want to go with this feature?
Yes. This feature will give a huge benefit in reducing quantization time. But it is not enough tested, --parallel
will not be a default option (at least for a while).
I understand your concern about thread-safety, but IIUC not all modules have to be thread-safe. In the draft, only the below code is executed in parallel.
#pragma omp parallel for
for (uint32_t record_idx = 0; record_idx < num_records; record_idx++)
{
const auto current_thread_num = omp_get_thread_num();
for (int32_t input_idx = 0; input_idx < num_inputs; input_idx++)
{
const auto *input_node = loco::must_cast<const luci::CircleInput *>(input_nodes[input_idx]);
const auto &cur_input_data = vector_input_data[record_idx][input_idx];
_interpreters[current_thread_num]->writeInputTensor(input_node, cur_input_data.data(),
cur_input_data.size());
}
_interpreters[current_thread_num]->interpret();
} // #pragma omp parallel
IIUC the above code is safe unless multiple luci-interpreter
instances have a shared state (e.g., global variable). I don't remember such a code in luci-interpreter.
@binarman @BalyshevArtem Could you give an opinion about the thread-safety of luci-interpreter? It is not easy to guarantee, so just an opinion is enough.
@binarman @BalyshevArtem Could you give an opinion about the thread-safety of luci-interpreter? It is not easy to guarantee, so just an opinion is enough.
In my opinion, yes, lucy-interpreter is thread-safety, since writing only happens to tensors and observers that are unique to each interpreter.
@jinevening In my opinion our internal luci-interpreter code itself is thread safe. But there are some code from tensorflow (and maybe other libraries), which we did not fully control and check. There is a slight chance that this code could be not thread-safe.
I think we mitigate risks by adding tests with thread sanitizer (gcc also has this sanitizer)
But there are some code from tensorflow (and maybe other libraries), which we did not fully control and check. There is a slight chance that this code could be not thread-safe.
Yes, I'm worried about this case. Particularly, we didn't check kernels for all Ops are thread-safe.
I think we mitigate risks by adding tests with thread sanitizer (gcc also has this sanitizer)
Looks a good idea. We may need tests, e.g., record-minmax-thread-safety-test
which runs record-minmax --parallel
with a thread sanitizer on various models (e.g., models in common-artifacts). @BalyshevArtem Can you (or other ones in SRR) add the test in the draft before landing the feature? This task is not urgent so you can take time.
Can you (or other ones in SRR) add the test in the draft before landing the feature?
I will add the test :)
Looks a good idea. We may need tests, e.g.,
record-minmax-thread-safety-test
which runsrecord-minmax --parallel
with a thread sanitizer on various models (e.g., models in common-artifacts).
I added this test, see https://github.com/Samsung/ONE/pull/10132. During realization I found that thread sanitizer with OpenMP have a lot of falls positive results. So I rewrote it from OpenMP to std::thread and now it works fine.
What
Let's add parallelism to record-minmax with help of OpenMP.
Why
To speed up recording
How
Let's add parallel recording for different input data format with plan:
h5
input_data_format #10132list
input_data_formatdirectory
input_data_format