esmero / archipelago-deployment

Archipelago Commons Docker Deployment Repository
33 stars 15 forks source link

Upgrade to PHP 7.4.3+ and Drupal 8/9 does not support officially PHP 8 #75

Open DiegoPino opened 3 years ago

DiegoPino commented 3 years ago

Hey. Memory leaks and garbage collection in PHP.

So. During my super production tests and while allowing a HUGE file (Gbyte+ to be delivered on demand to a viewer) to deal with a Range-IFrequest I saw a few PHP Out of memory messages. Guzzle was leaking memory through Symfony (binaryresponse).

NOTICE: PHP message: PHP Fatal error:  Allowed memory size of 536870912 bytes exhausted (tried to allocate 989388968 bytes) in /var/www/html/vendor/guzzlehttp/psr7/src/Stream.php on line 225

Digging deeper and deeper happened to be a PHP bug (not our first one but also means we are stretching things quite a bit!) See https://github.com/php/php-src/pull/5014

The actual solution was merged in PHP 7.4.3+. So. Time to move forward if we want to support large large files. Related to this exploratory commit.

https://github.com/esmero/format_strawberryfield/commit/f53adb4e62833bdf8fa581659a10c14e6834d378

What will happen?

But now some testing first!!

DiegoPino commented 3 years ago

Ok, this is what I managed (had to be quite explicit)

docker exec -ti esmero-php bash -c "php --version"
PHP 7.4.9 (cli) (built: Sep  1 2020 02:58:08) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
    with Zend OPcache v7.4.9, Copyright (c), by Zend Technologies
DiegoPino commented 3 years ago

And no luck. The way the Symfony BinaryResponse is written makes me think there will be always memory issues.

Look https://github.com/symfony/http-foundation/blob/5.x/BinaryFileResponse.php#L290-L313

If there is a Range request then stream may (depending on the size of the range) fit in memory, but the fact that it fopens the file directly not reading it in chunks is bad.

I have a few (lot's of code) approaches

Use a similar approach to the streaming response i was generating but with Offsets to act on a RANGE / IF-RANGE header. Reading 4096 or so chunks at the time but only copying the ones I need to the output.

Gosh.

So, this affects of course Giancarlo (FileSystem) and us (S3). I can work around the S3 option by checking extension/size and asking for a presigned URL and let the whole thing be handled as a static file/natively

Ideas?

DiegoPino commented 3 years ago

Success!! @giancarlobi @alliomeria I knew I could. PHP wanted me not. AWS SDK S3 wanted me not. Guzzle wanted me not. But i did it. Before I fall I sleep I will explain the solution:

The problem:

S3/Guzzle mix, when requesting a Range of bytes try to open a Seekable stream (means I can move in the data and seek for the bytes). Basically they try to GET the whole File (imagine one 1 Gbyte) into a special Guzzle Stream named CachingStream. This in theory is Ok. But! When Symfony does the binaryResponse return it copies the values from that Special stream from the requested offset, to the requested end and when that happens (stupid PHP) it actually LOADS in memory the offset (all the bytes that come before the request) basically making the whole caching thing useless and using all the memory the server has. Its not really this method https://github.com/symfony/http-foundation/blob/5.x/BinaryFileResponse.php#L303 that is the culprit but the Guzzle CachingStream read method. That tried to catch up and get first (because of Seek) with the 900Mbytes that it needs to deliver the requested segment (range) Here https://github.com/guzzle/psr7/blob/master/src/CachingStream.php#L78

And there is NO way I can change that. (I could rewrite the whole thing and make it read by chunks... but that is too much)

Solution:

Our storage servers can deliver also ranges. So someone asks me for a range and I can ask S3/Minio/Azure etc for the same range back instead of trying to catch all on memory. So I created a new Response class that translate the range I'm being asked for into a ranged request to the remote URL of our storage. Which means I get only the bytes I need. and then instead of copying data between an OFFSET and END RANGE I just copy data from 0-endbytes which does not require seeking at all and means PHP/S3/Guzzle do not try to make me use the CachingStream.php which leaks memory and I can deliver what I want/fast and without big memory consumption.

Some Logs (because who cares its 12AM again and I will be able to sleep)

This is the my return to a ranged request from replay.web

NOTICE: PHP message: ranged deliver
NOTICE: PHP message: array (
  'cache-control' => 
  array (
    0 => 'public',
  ),
  'date' => 
  array (
    0 => 'Thu, 19 Nov 2020 04:13:20 GMT',
  ),
  'last-modified' => 
  array (
    0 => 'Thu, 19 Nov 2020 02:50:59 GMT',
  ),
  'etag' => 
  array (
    0 => 'W/"724bc829e985ce820059804bdb48f200"',
  ),
  'content-type' => 
  array (
    0 => 'application/warc',
  ),
  'content-length' => 
  array (
    0 => 10958,
  ),
  'accept-ranges' => 
  array (
    0 => 'bytes',
  ),
  'content-range' => 
  array (
    0 => 'bytes 2539193-2550150/312400453',
  ),
)
NOTICE: PHP message: offset:2539193
NOTICE: PHP message: end:2550150
NOTICE: PHP message: about to stream_copy_to_stream Ranged binary response
NOTICE: PHP message: max:10958
NOTICE: PHP message: offset:10958

So I only deliver 10958 bytes.

And this is my internal call to Mini.io (in this case)

esmero-minio [REQUEST s3.GetObject] 04:13:21.595
esmero-minio GET /archipelago/724/application-moabvideos-e09611ba-4b5f-4c76-8167-2c5e27a1c583.wacz
esmero-minio Proto: HTTP/1.1
esmero-minio Host: esmero-minio:9000
esmero-minio X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
esmero-minio Authorization: AWS4-HMAC-SHA256 Credential=minio/20201119/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=642e8a090b36a04ffe61b23af43d338c00fa2fb6350bb3ad52c96a3fbaca07fe
esmero-minio Aws-Sdk-Invocation-Id: d109a6bd95c392ed6ca1a7fccde1da5e
esmero-minio Content-Length: 0
esmero-minio Range: bytes=2539193-2550150
esmero-minio User-Agent: aws-sdk-php/3.162.0 GuzzleHttp/6.5.5 curl/7.69.1 PHP/7.4.9
esmero-minio Aws-Sdk-Retry: 0/0
esmero-minio Connection: close
esmero-minio X-Amz-Date: 20201119T041321Z
esmero-minio <BODY>
esmero-minio [RESPONSE] [04:13:21.634] [ Duration 39.108ms  ↑ 130 B  ↓ 11 KiB ]
esmero-minio 206 Partial Content
esmero-minio Content-Length: 10958
esmero-minio Content-Type: application/warc
esmero-minio Server: MinIO/RELEASE.2020-11-13T20-10-18Z
esmero-minio Vary: Origin
esmero-minio X-Xss-Protection: 1; mode=block
esmero-minio Accept-Ranges: bytes
esmero-minio Content-Range: bytes 2539193-2550150/312400453
esmero-minio Content-Security-Policy: block-all-mixed-content
esmero-minio ETag: "724bc829e985ce820059804bdb48f200"
esmero-minio Last-Modified: Thu, 19 Nov 2020 02:50:59 GMT
esmero-minio X-Amz-Request-Id: 1648CD7C75EDCB78
esmero-minio <BODY>
esmero-minio 

Anyways.... I managed to deliver WACZ files without issues. First call is always to the end of the webarchive WACZ format so I need to be able to do it this way.

Code needs some cleanup now and to be more verbose about the checks I do/plus flexible for more backend options (Azure range request is different)

Also I learned:

That a range request of

Range: bytes 2539193-2550150/312400453',

Is invalid for S3 and if I do that I deliver all.

The right one is https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35

Range: bytes=2539193-2550150'

The things you learn reading GO code and JAVA.

DiegoPino commented 3 years ago

Finally because I'm a full NERD let me show you a thing that took me one hour to solve...

First file is the return of Archipelago, the second is what Archipelago should have returned (Both 65558 bytes) image

See the 0? They are nOT equal? What was happening? Did I calculate the ranges wrong? Was S3 failing?

No! I added an error_log(var_export($something)) whole debugging like crazy without a "true" argument and the value (a 0) got appended to the streamed output!

Gosh. yeah. I know who cares. Still funny.