[record-minmax] Add parallel recording

BalyshevArtem commented 1 year ago

What

Let's add parallelism to record-minmax with help of OpenMP.

Why

To speed up recording

How

Let's add parallel recording for different input data format with plan:

[x] add h5 input_data_format #10132
[ ] add list input_data_format
[ ] add directory input_data_format
[ ] add for random inputs

BalyshevArtem commented 1 year ago

with #10132 for mobilenet_v1_1.0_224.circle.txt and for random h5 file with 50 num records I had the following result (host has 8 threads)(using ./time):

without parallel recording:

real    8m54,355s
user    8m54,286s
sys 0m0,048s

with parallel recording:

real    1m30,052s
user    10m8,754s
sys 0m0,671s

jinevening commented 1 year ago

Looks very good feature! Before landing this feature, we need some tests.

Option to turn off OpenMP for supporting other platforms e.g., Arm. @mhs4670go already left a comment. PTAL https://github.com/Samsung/ONE/pull/10132#issuecomment-1327465549.
Value check. Min/max values recorded in sequential/parallel settings must be the same. The easiest way to add a test would be to add pota_parallel_record_minmax_test in pota-quantization-value-test.

FYI, I forked https://github.com/BalyshevArtem/ONE/tree/parallel_record_minmax and ran pota-quantization-value-test with parallel_record (by modifing test_record_minmax.sh), but it failed.

104/111 Test  #98: pota_record_minmax_test ....................***Failed    5.19 sec
-- Found RECORD-MINMAX: /home/hjj/ONE/build/release/compiler/record-minmax/record-minmax
-- Found CIRCLE_TENSORDUMP: /home/hjj/ONE/build/release/compiler/circle-tensordump/circle-tensordump
-- Found workdir: /home/hjj/ONE/build/release/compiler/common-artifacts
~/ONE/build/release/compiler/common-artifacts ~/ONE/build/release/compiler/pota-quantization-value-test
~/ONE/build/release/compiler/pota-quantization-value-test
FAILED
- Add_002.layer.uint8
- Add_002.channel.int16
...

BalyshevArtem commented 1 year ago

FYI, I forked https://github.com/BalyshevArtem/ONE/tree/parallel_record_minmax and ran pota-quantization-value-test with parallel_record (by modifing test_record_minmax.sh), but it failed

Hmmm, I checked it now locally and it passed

artem_bal@a-balyshev02:~/ONE/infra/nncc/cmake-build-debug/compiler/pota-quantization-value-test$ ctest -R pota_record_minmax_test

Test project /home/artem_bal/ONE/infra/nncc/cmake-build-debug/compiler/pota-quantization-value-test

1/1 Test #2: pota_record_minmax_test ..........   Passed   12.31 sec

100% tests passed, 0 tests failed out of 1

if I checked for example Add_002.layer.uint8_record_minmax.log I found:

Using parallel recording

So, it was parallel recording mode.

seanshpark commented 1 year ago

Please prepare a full DRAFT to show all the changes

jinevening commented 1 year ago

@BalyshevArtem Ah sorry. I misused the API (I just gave --parallel_record, but the correct interface was --parallel_record true). The test passed in my environment too. Thanks.

seanshpark commented 1 year ago

I just gave --parallel_record, but the correct interface was --parallel_record true

I would be better to check value of --parallel_record is valid or not; if only true is valid then if not true then throw error.

seanshpark commented 1 year ago

record-minnmax uses lots of modules and I am not sure all these modules are thread-safe. at least I do NOT garentee the modules are thread-safe.

@jinevening , are you sure you want to go with this feature?

jinevening commented 1 year ago

are you sure you want to go with this feature?

Yes. This feature will give a huge benefit in reducing quantization time. But it is not enough tested, --parallel will not be a default option (at least for a while).

I understand your concern about thread-safety, but IIUC not all modules have to be thread-safe. In the draft, only the below code is executed in parallel.

#pragma omp parallel for
  for (uint32_t record_idx = 0; record_idx < num_records; record_idx++)
  {
    const auto current_thread_num = omp_get_thread_num();
    for (int32_t input_idx = 0; input_idx < num_inputs; input_idx++)
    {
      const auto *input_node = loco::must_cast<const luci::CircleInput *>(input_nodes[input_idx]);

      const auto &cur_input_data = vector_input_data[record_idx][input_idx];
      _interpreters[current_thread_num]->writeInputTensor(input_node, cur_input_data.data(),
                                                          cur_input_data.size());
    }
    _interpreters[current_thread_num]->interpret();
  } // #pragma omp parallel

IIUC the above code is safe unless multiple luci-interpreter instances have a shared state (e.g., global variable). I don't remember such a code in luci-interpreter.

@binarman @BalyshevArtem Could you give an opinion about the thread-safety of luci-interpreter? It is not easy to guarantee, so just an opinion is enough.

BalyshevArtem commented 1 year ago

@binarman @BalyshevArtem Could you give an opinion about the thread-safety of luci-interpreter? It is not easy to guarantee, so just an opinion is enough.

In my opinion, yes, lucy-interpreter is thread-safety, since writing only happens to tensors and observers that are unique to each interpreter.

binarman commented 1 year ago

@jinevening In my opinion our internal luci-interpreter code itself is thread safe. But there are some code from tensorflow (and maybe other libraries), which we did not fully control and check. There is a slight chance that this code could be not thread-safe.

I think we mitigate risks by adding tests with thread sanitizer (gcc also has this sanitizer)

jinevening commented 1 year ago

But there are some code from tensorflow (and maybe other libraries), which we did not fully control and check. There is a slight chance that this code could be not thread-safe.

Yes, I'm worried about this case. Particularly, we didn't check kernels for all Ops are thread-safe.

I think we mitigate risks by adding tests with thread sanitizer (gcc also has this sanitizer)

Looks a good idea. We may need tests, e.g., record-minmax-thread-safety-test which runs record-minmax --parallel with a thread sanitizer on various models (e.g., models in common-artifacts). @BalyshevArtem Can you (or other ones in SRR) add the test in the draft before landing the feature? This task is not urgent so you can take time.

BalyshevArtem commented 1 year ago

Can you (or other ones in SRR) add the test in the draft before landing the feature?

I will add the test :)

BalyshevArtem commented 1 year ago

Looks a good idea. We may need tests, e.g., record-minmax-thread-safety-test which runs record-minmax --parallel with a thread sanitizer on various models (e.g., models in common-artifacts).

I added this test, see https://github.com/Samsung/ONE/pull/10132. During realization I found that thread sanitizer with OpenMP have a lot of falls positive results. So I rewrote it from OpenMP to std::thread and now it works fine.

Samsung / ONE

[record-minmax] Add parallel recording #10133

What

Why

How