ales-t / rjp

Rapid JSON-lines processor
Apache License 2.0
3 stars 0 forks source link

Way to approach paralelization #6

Open zouharvi opened 2 years ago

zouharvi commented 2 years ago

There are multiple ways in which we could do paralelization. Assuming that each line can be processed independently (which may not be true in the future if we make more processors), then with workers A B C the paralelization may look as follows (assume 12 lines):

  1. ABC ABC ABC ABC each worker is fed exactly one line and when all three are finished, they are sent to the buffer. This is the easiest solution but may not even be worth it given that thread spawning is not zero cost.
  2. AABBCC AABBCC another step is to increase the number of lines fed to each worker (here 2). There is certainly a threshold number from which it becomes beneficial to use threading (that number probably being larger than 1).
  3. AAA BBB CCC this chunks the whole data into equal parts and feeds them to the workers. It creates the largest chunks and is the fastest from paralelization perspective but requires that the whole data is first loaded into memory which may not be viable for large files. (In a sense this is an extreme case of 3)
  4. ABC AAABB CAC this is the most complicated solution (though probably the fastest one). It considers the processing dynamically and e.g. if A is finished and B and C are working on some long and difficult lines, then A can get next lines assigned.

I think that 2 is the best option because it can be parametrized via two parameters from the CLI. Maybe I missed some approach.

ales-t commented 2 years ago

Assuming that each line can be processed independently (which may not be true in the future if we make more processors)

This is not true already, the merge processor requires the order of input lines to be the same as the order in the "stream to merge". It could be addressed by keeping line/record indices but would make things a bit more complicated.

Regarding the best approach to parallelization -- first of all, I think we should try to gauge the benefits with some benchmarks. One relatively easy way would be to use GNU parallel --pipe --keep-order, and experiment with block sizes.

When I was thinking about parallelization directly in rjp, I had a vague idea like this;

I guess that would correspond to your solution 2 (or maybe 4?). My estimate is that the most complicated parts would be: