Way to approach paralelization

ales-t / rjp

Rapid JSON-lines processor

Apache License 2.0

3 stars 0 forks source link

There are multiple ways in which we could do paralelization. Assuming that each line can be processed independently (which may not be true in the future if we make more processors), then with workers A B C the paralelization may look as follows (assume 12 lines):

ABC ABC ABC ABC each worker is fed exactly one line and when all three are finished, they are sent to the buffer. This is the easiest solution but may not even be worth it given that thread spawning is not zero cost.
AABBCC AABBCC another step is to increase the number of lines fed to each worker (here 2). There is certainly a threshold number from which it becomes beneficial to use threading (that number probably being larger than 1).
AAA BBB CCC this chunks the whole data into equal parts and feeds them to the workers. It creates the largest chunks and is the fastest from paralelization perspective but requires that the whole data is first loaded into memory which may not be viable for large files. (In a sense this is an extreme case of 3)
ABC AAABB CAC this is the most complicated solution (though probably the fastest one). It considers the processing dynamically and e.g. if A is finished and B and C are working on some long and difficult lines, then A can get next lines assigned.

I think that 2 is the best option because it can be parametrized via two parameters from the CLI. Maybe I missed some approach.

Assuming that each line can be processed independently (which may not be true in the future if we make more processors)

This is not true already, the merge processor requires the order of input lines to be the same as the order in the "stream to merge". It could be addressed by keeping line/record indices but would make things a bit more complicated.

Regarding the best approach to parallelization -- first of all, I think we should try to gauge the benefits with some benchmarks. One relatively easy way would be to use GNU parallel --pipe --keep-order, and experiment with block sizes.

When I was thinking about parallelization directly in rjp, I had a vague idea like this;

One I/O thread reads input lines into batches, putting them in an mpsc::channel. Each sentence in the batch has an absolute index associated with it (its position in the input stream).
Workers fetch new batches from the channel when they finish their current batch. There is an output mpsc::channel for collecting the processed data.
An output thread pulls processed batches from the output channel, sorting them as needed.

I guess that would correspond to your solution 2 (or maybe 4?). My estimate is that the most complicated parts would be:

Handling bookkeeping and order of sentences correctly.
Converting the merge processor to support this. As a starting solution, we could simply die with a warning when the user tries to use both parallelism and merge.

ales-t / rjp

Way to approach paralelization #6