Updates to support streaming writes to flat files

nishanthpp93 commented 2 years ago

Enable JDBC extract parallelization
Enable streaming
Support streaming writes to JSONL file
Add schema and set row schema
Add example config file

qqndrew commented 2 years ago

Thanks again for working on this! Am working on merging this - for auditing/transparency, I will append comments on individual items below. Nothing is final, so please feel free to DM me on slack and dispute if you don't agree with the reasoning

qqndrew commented 2 years ago

Enable JDBC extract parallelization

I am reverting this particular change as it is actually counterproductive. What this specific method does is it tries to reshuffle batches evenly between workers. While this sounds good on paper, what it does in function is it drastically increases memory load, as we already (roughly) evenly spread this out with our batching strategy beforehand w/ pagination and offsets. In other words, the output is already parallelized (that's why we do the whole batching/pagination thing and then read one page per thread), what this method does is try to spread out the number of items in each thread evenly, which it already is because page size is the same.

Additionally, on some runners and depending on individual configuration (particularly in batch/non-streaming mode), this can cause the entire ResultSet to be loaded into memory, as that is required for shuffling (can't shuffle what you don't already have in memory), which is not feasible for reasons that may be obvious. This is slightly different since you are using streaming, but I will address why we are changing that as well in a future comment.

qqndrew commented 2 years ago

Support streaming writes to JSONL file This is a good idea and I will most likely merge this in, with some caveats: 1) I will add a commit move this into its own separate Extract (e.g., JSONExtract and CSVExtract), instead of full-on replacing CSV functionality. CSV will still be desired in many use cases 2) Potential concern as CSV is what is being sent to N3C, you would have to do additional processing post-hoc if you wanted to do this to convert. If the concern is newlines in snippet, I would note that snippets are supposed to be truncated from the N3C submission anyways. 3) Removal of the dependency on streaming datasource

nishanthpp93 commented 2 years ago

Thanks Andrew! Your points are valid and I agree. Feel free to make the changes.

I did notice the JDBC extract parallelization seemed constrained into batches, which like you said helps optimize for memory, so that change I made is unnecessary. Reverting it can help ease some backpressure, which I noticed was high for the extract step.
I was trying to load everything back into SQL and use JOINs and ROW NUMBER OVER() for the note_nlp_id and other fields from the source table. But this is wasteful when it can be done in one go. I believe the id assigning can be done in the final step before the write, during collection, since the order doesn't matter, so it is possible to write to the expected CSV output directly without any slowdown.

OHNLP / Backbone

Updates to support streaming writes to flat files #6