Closed nishanthpp93 closed 2 years ago
Thanks again for working on this! Am working on merging this - for auditing/transparency, I will append comments on individual items below. Nothing is final, so please feel free to DM me on slack and dispute if you don't agree with the reasoning
I am reverting this particular change as it is actually counterproductive. What this specific method does is it tries to reshuffle batches evenly between workers. While this sounds good on paper, what it does in function is it drastically increases memory load, as we already (roughly) evenly spread this out with our batching strategy beforehand w/ pagination and offsets. In other words, the output is already parallelized (that's why we do the whole batching/pagination thing and then read one page per thread), what this method does is try to spread out the number of items in each thread evenly, which it already is because page size is the same.
Additionally, on some runners and depending on individual configuration (particularly in batch/non-streaming mode), this can cause the entire ResultSet to be loaded into memory, as that is required for shuffling (can't shuffle what you don't already have in memory), which is not feasible for reasons that may be obvious. This is slightly different since you are using streaming, but I will address why we are changing that as well in a future comment.
Thanks Andrew! Your points are valid and I agree. Feel free to make the changes.