cfpb / grasshopper-parser

Address Parsing REST API
Creative Commons Zero v1.0 Universal
8 stars 8 forks source link

Streaming Batch Parsing? #34

Open hkeeler opened 8 years ago

hkeeler commented 8 years ago

We can probably get a significant performance boost if we support streaming of both the request and response of the batch parser. To do this we would need to switch to a streaming JSON parser like ijson. This would allow parsing of addresses as soon as the first one arrives, rather than waiting on the full JSON message. This would allows for a much smaller memory footprint, and the larger the message gets, the more benefit.

I also just realized that the new batch response format is not stream friendly. Since I've split up the message into top-level parsed and failed arrays, the full message has to be built before the response can be sent. If we want to do response streaming, we'll probably want to refactor that, adding a status (success/fail) to each parse attempt, and just return them in the order they were received. Larger overall message, but streamable.

And finally, this streaming approach only helps with the I/O overhead. Most of the time spent on a request is spent CPU-bound parsing the addresses. If we want to really chew through a lot of addresses, we could you concurrent.futures.ProcessPoolExecutor, which gives a nice abstraction for concurrent processing, and uses processes instead of threads, so we're not limited by Python's GIL.

This is all premature optimization at this point, but wanted to get it down before I forget.