Closed aogier closed 5 years ago
I think really we should be streaming the HTTP request using stream=True
(see docs):
r_session.get(url, headers=headers, params=f_params, stream=True)
Then, we'd read the data like this:
@staticmethod
def __iter_rows(response):
for line in response.iter_lines():
line = line.decode('utf-8')
if line.startswith('{"id":'):
yield json.loads(line.rstrip(','))
def __iter__(self):
...
skip = 0
while True:
response = self._ref(
limit=self._page_size,
skip=skip,
stream=True,
**self.options
)
skip += self._page_size
for row in self.__iter_rows(response):
yield row
That would avoid reading the entire response into memory. One for another day though!
that would be great, also I like the parsing trick. The only issue I think of is when a relatively long operation takes place on received items: could the request timeout during download? It should be investigated (at least by me, since I don't know that). That's a nice idea anyway I'd like to have the feature in some of my jobs!
Checklist
CHANGES.md
|CHANGELOG.md
) or test/build only changesDescription
Working on #437 I've noticed this one that is trivial but effectively halves memory utilization during iteration. As we release ram before
yield
we let users exploit their memory for doing whatever they would (I've seen people/comments in the wild speaking about documents > hundred MB in size).We could elaborate further implementing a FIFO on the iterator three lines below, eg. with a deque that pops while iterating but this one alone already effectively halves ram.
Schema & API Changes
Security and Privacy
Testing
Monitoring and Logging