halves memory usage Result iterator side

aogier commented 5 years ago

Checklist

[x] Tick to sign-off your agreement to the Developer Certificate of Origin (DCO) 1.1
[x] Added tests for code changes or test/build only changes
[x] Updated the change log file (CHANGES.md|CHANGELOG.md) or test/build only changes
[x] Completed the PR template below:

Description

Working on #437 I've noticed this one that is trivial but effectively halves memory utilization during iteration. As we release ram before yield we let users exploit their memory for doing whatever they would (I've seen people/comments in the wild speaking about documents > hundred MB in size).

We could elaborate further implementing a FIFO on the iterator three lines below, eg. with a deque that pops while iterating but this one alone already effectively halves ram.

Schema & API Changes

"No change"

Security and Privacy

"No change"

Testing

No new tests because current ones are enough

Monitoring and Logging

"No change"

smithsz commented 5 years ago

I think really we should be streaming the HTTP request using stream=True (see docs):

r_session.get(url, headers=headers, params=f_params, stream=True)

Then, we'd read the data like this:

    @staticmethod
    def __iter_rows(response):
        for line in response.iter_lines():
            line = line.decode('utf-8')
            if line.startswith('{"id":'):
                yield json.loads(line.rstrip(','))

    def __iter__(self):
       ...

        skip = 0
        while True:
            response = self._ref(
                limit=self._page_size,
                skip=skip,
                stream=True,
                **self.options
            )

            skip += self._page_size
            for row in self.__iter_rows(response):
                yield row

That would avoid reading the entire response into memory. One for another day though!

aogier commented 5 years ago

that would be great, also I like the parsing trick. The only issue I think of is when a relatively long operation takes place on received items: could the request timeout during download? It should be investigated (at least by me, since I don't know that). That's a nice idea anyway I'd like to have the feature in some of my jobs!

cloudant / python-cloudant