Open danfinlay opened 9 years ago
So excited to see this happening!
Does the two-year-old streams
branch have tests that might be useful here?
I believe the only streaming tests we have are here: https://github.com/geddy/model/blob/master/test/integration/adapters/streaming.js
Since that's the base sql adapter using the EventedQueryProcessor
, it's looking to me like the mysql
adapter isn't using using real streams either, even though the Mysql module natively supports streams.
One thing that seems a little tricky to me right now is how to swap out the individual database's streaming mechanisms.
I've also always wondered where exactly the SQL responses are converted into Model
objects. Seems like it would be pretty cool to set up a through stream that converts the SQL responses into Model objects with no extra work...
Converting to real streams with the current architecture is a little strange, posting my thoughts here so others could chime in:
There is currently an innerProcessor that initially makes a minimal query to retrieve an array of IDs before querying for full models and associations. My current considerations center around whether this is necessary for a streaming adapter, and if so, how should it properly be implemented.
While a simple solution would be to make the initial query non-streaming, and then stream back the full results, I feel like if we're going to do streaming, we should do it really properly, and this means making no assumptions like "The array of IDs is a reasonable size to return without streaming and hold in memory".
One way would be for the initial query to stream back the IDs, in which case we have to decide how frequently to make the larger association requests.
My hope is that the initial query isn't necessary if using true streams. Curious for someone who is familiar with this pre-query to chime in on its purpose. I swear you could do most of these sql queries without a pre-query, but I may be missing something.
Oh I see now: JOINs don't work with limit/offset
.
So basically we'd have to implement both ways:
Non-limiting/offsetting streaming requests can have a direct stream, but requests with limit or offset need IDs first, and so the question of "how often to query for subsequent results" remains.
It almost makes me think it could be a config variable or request option, how many records to fetch at a time when streaming with a limit or offset. That's a pretty obscure variable, but I wouldn't want to bury it as an assumption...
What if instead of making a pre-query, we made a subquery?
I'm pretty sure both Postgres and MySQL support this:
Instead of requesting IDs, then making a second request with an IN
clause to that array, you simply include that first query inside the IN
clause to begin with. There might be an extra step to select the id field from that returned table, if you had to request other values to filter/sort on.
I was processing a pretty big dataset the other day, and naturally was feeling pretty grateful that Model supports streaming results, until der_On and I discovered:
The PG Adapter Only Fakes Streaming!
It inherits from
EventedQueryProcessor
, which inherits from anEventEmitter
, and only fakes a Stream object, but without the critical back-pressure handling and memory-efficiency of actual streams!You can see an example of how this is presently implemented here.
One good reason this may have been done originally, is that the core PG module does not support streams.
Fortunately, there is a module, node-pg-query-stream that does exactly what we need here.
I'm adding this issue to help guide efforts at adding proper streaming to the PG adapter, and am going to be taking a shot myself, although this will definitely be the deepest I've ever delved into
Model
, so no promises! (Only callbacks)