geddy / model

Datastore-agnostic ORM in JavaScript
265 stars 55 forks source link

PG Adapter Presently Only Fakes Streaming #251

Open danfinlay opened 9 years ago

danfinlay commented 9 years ago

I was processing a pretty big dataset the other day, and naturally was feeling pretty grateful that Model supports streaming results, until der_On and I discovered:

The PG Adapter Only Fakes Streaming!

My shocked face

It inherits from EventedQueryProcessor, which inherits from an EventEmitter, and only fakes a Stream object, but without the critical back-pressure handling and memory-efficiency of actual streams!

You can see an example of how this is presently implemented here.

One good reason this may have been done originally, is that the core PG module does not support streams.

Fortunately, there is a module, node-pg-query-stream that does exactly what we need here.

I'm adding this issue to help guide efforts at adding proper streaming to the PG adapter, and am going to be taking a shot myself, although this will definitely be the deepest I've ever delved into Model, so no promises! (Only callbacks)

mde commented 9 years ago

So excited to see this happening!

danfinlay commented 9 years ago

Does the two-year-old streams branch have tests that might be useful here?

mde commented 9 years ago

I believe the only streaming tests we have are here: https://github.com/geddy/model/blob/master/test/integration/adapters/streaming.js

danfinlay commented 9 years ago

Since that's the base sql adapter using the EventedQueryProcessor, it's looking to me like the mysql adapter isn't using using real streams either, even though the Mysql module natively supports streams.

One thing that seems a little tricky to me right now is how to swap out the individual database's streaming mechanisms.

I've also always wondered where exactly the SQL responses are converted into Model objects. Seems like it would be pretty cool to set up a through stream that converts the SQL responses into Model objects with no extra work...

danfinlay commented 9 years ago

Converting to real streams with the current architecture is a little strange, posting my thoughts here so others could chime in:

There is currently an innerProcessor that initially makes a minimal query to retrieve an array of IDs before querying for full models and associations. My current considerations center around whether this is necessary for a streaming adapter, and if so, how should it properly be implemented.

While a simple solution would be to make the initial query non-streaming, and then stream back the full results, I feel like if we're going to do streaming, we should do it really properly, and this means making no assumptions like "The array of IDs is a reasonable size to return without streaming and hold in memory".

One way would be for the initial query to stream back the IDs, in which case we have to decide how frequently to make the larger association requests.

My hope is that the initial query isn't necessary if using true streams. Curious for someone who is familiar with this pre-query to chime in on its purpose. I swear you could do most of these sql queries without a pre-query, but I may be missing something.

danfinlay commented 9 years ago

Oh I see now: JOINs don't work with limit/offset.

So basically we'd have to implement both ways:

Non-limiting/offsetting streaming requests can have a direct stream, but requests with limit or offset need IDs first, and so the question of "how often to query for subsequent results" remains.

It almost makes me think it could be a config variable or request option, how many records to fetch at a time when streaming with a limit or offset. That's a pretty obscure variable, but I wouldn't want to bury it as an assumption...

danfinlay commented 9 years ago

What if instead of making a pre-query, we made a subquery?

I'm pretty sure both Postgres and MySQL support this:

Instead of requesting IDs, then making a second request with an IN clause to that array, you simply include that first query inside the IN clause to begin with. There might be an extra step to select the id field from that returned table, if you had to request other values to filter/sort on.