amphp / byte-stream

A non-blocking stream abstraction for PHP based on Amp.
https://amphp.org/byte-stream
MIT License
363 stars 31 forks source link

ReadableStream: Clear data from buffer on read #110

Closed vikingjs closed 5 months ago

vikingjs commented 5 months ago

I am using the ReadableIterableStream and it is awesome, but I am reading a very big list of large things. I am reading as quickly as the data comes in, but I'm still getting errors when I exceed the buffer size.

I can make the buffer bigger (and bigger and bigger -- 300MB now but will likely exceed 1GB) but chewing up all that ram for data I've already read is a waste and seems to impair performance.

ConcurrentQueueIterator could delete data as it's read pretty easily, it looks like to me; FiberLocal supports that operation already. There's nothing in the call chain that suggests to me that this option is already available, however.

In the most common use cases, once data is read it is no longer needed in the buffer - it has been buffed. A way to prevent this buildup would be welcome.

trowski commented 5 months ago

Hi @vikingjs!

ReadableIterableStream and ConcurrentQueueIterator do not retain values once they have been consumed.

QueueState (the class underlying the implementation of Queue and ConcurrentQueueIterator) unsets values when they are set to the FiberLocal. See line 86 in QueueState.php.

It sounds to me as though there might be some other bug causing a memory leak or you're buffering the data read, which of course is going to consume that memory as well.

vikingjs commented 5 months ago

That's good to know. The error I'm getting is Amp\Http\Client\ParseException: Configured body size exceeded: so I don't know how code outside the stream would cause it. Perhaps I'm not consuming as quickly as I thought I was.

I will slow down the server side to see what that does for the buffer.

I'm tempted to try to parallelize reading off that stream, but occasionally two reads are required to instantiate one resulting object. Do you know a way to manage that? It's a case where a single item written into the stream on the server requires two reads on the client.

vikingjs commented 5 months ago

This seems strange in SizeLimitingReadableStream:

        $chunk = $this->source->read($cancellation);

        if ($chunk !== null) {
            $this->bytesRead += \strlen($chunk);
            if ($this->bytesRead > $this->sizeLimit) {
                $this->exception = new ParseException(
                    "Configured body size exceeded: {$this->bytesRead} bytes received, while the configured limit is {$this->sizeLimit} bytes",
                    HttpStatus::PAYLOAD_TOO_LARGE
                );

                $this->source->close();
            }
        }

The body size limit is not applied to the amount of data in the body, but the overall amount read. How do I get out from under the tyranny of this class?

trowski commented 5 months ago

The HTTP client has a default body size limit (10MB) which can be configured on the Request object using setBodySizeLimit(). This exists to allow completely buffering a response body without worry of consuming too much memory. Since you're streaming the response body, feel free to make this number huge, even just using PHP_INT_MAX to effectively turn off the size limit.

vikingjs commented 5 months ago

That's basically what I have done now. I was hoping to remove that dang thing from the chain as it adds only negative value, but I can get over it. Thanks for your help!

trowski commented 5 months ago

It appears we may have intended a body size of 0 to turn off the limit (e.g. this use), but seems we may want to follow through on that, adding the check elsewhere and skip creating SizeLimitingReadableStream if it's not necessary.

I wouldn't concern yourself over an extra method call per chunk. There's a lot more CPU going to decompressing the response body.