digitaldogsbody / mingulay

PHP Zip file parser
GNU Affero General Public License v3.0
1 stars 1 forks source link

Investigate a better way of dealing with large requests #8

Open digitaldogsbody opened 2 years ago

digitaldogsbody commented 2 years ago

The change to using an Interface means that we now do stuff like request the whole CDR at once rather than just seeking to the right point and consuming (as this is a much better way of doing things in the abstract, since we don't know where the data returned by a seeker is coming from).

However, this means that with the current implementation of LocalFileSeeker, if you ask it for 2GB of data (this would be a massive CDR, but I guess it could be feasible?), it will happily read 2GB into a string in memory and return it.

I think there should be a better way of doing this, by streaming the data when it is read. Need to maybe think about implementing a data stream in the Interface?

digitaldogsbody commented 2 years ago

Maybe this is an argument for devolving the actual functionality for finding the different parts back to the Seekers?

I like having the clean abstraction of the main ZipRangeReader being the only part that needs to know about how to find an EOCD etc, but we have lost the nuance of being able to implement the searching differently depending on where the resource is, and therefore possibly some efficiency as well. Hmmmmm.

digitaldogsbody commented 2 years ago

@DiegoPino this is another place where your thoughts would be gratefully appreciated :)

DiegoPino commented 2 years ago

@digitaldogsbody I think the solution here is to use the seeker to never read a single 2GB back. Basically the fetch should happen as in a streaming resources that on each invocation (while there is data get me 64Kb) gets chunks

DiegoPino commented 2 years ago

@digitaldogsbody one way of doing this is do implement a stream to stream copy with a persistent pointer that advances in the object created by either the seeker/or even better an internal Object to the seeker that keeps that info around. Its a bit like a state machine where the current open File from the ZIP and the cursor/pointer to the resources is kept as private class properties and implements probably seekable methods that just encapsulate methods of the seeker (move to position X, fetch XX bytes until the original range of the file is reached)

DiegoPino commented 2 years ago

https://www.php.net/manual/en/function.stream-copy-to-stream.php is one option.

DiegoPino commented 2 years ago

In combination with fopen and maybe memory or temp wrappers https://www.php.net/manual/en/wrappers.php

DiegoPino commented 2 years ago

@digitaldogsbody now if we really want to get fancy .... https://www.php.net/manual/en/stream.streamwrapper.example-1.php

This would (future enhancement) would allow a ziprange://someURLorPATH/somefile The thing here is that you can not have two streamwrappers connected, so someURLorPATH could be an s3://someresource or an http://anurl maybe double encoded? or in parenthesis? or between {} ? Which would then expand via the ziprange streamwrapper and return a handle to the internal with a pointer/state/seekable/etc

DiegoPino commented 2 years ago

Ok, that was too much. Sorry!

digitaldogsbody commented 2 years ago

@digitaldogsbody I think the solution here is to use the seeker to never read a single 2GB back. Basically the fetch should happen as in a streaming resources that on each invocation (while there is data get me 64Kb) gets chunks

Ok, this makes sense. You could still ask for a subset of the data, right?

For example, when we get the Central Directory Records from the Seeker, we know the offset and total length, but then the actual parsing is of different sized chunks:

1) Read 46 bytes (fixed length record) and extract length of 3 extra fields: file name, extra, comment 2) Read len(filename) bytes 3) Read len(extra) bytes 4) Read len(comment) bytes 5) Goto 1

So lets say we have 3 records for a total size of 256 bytes, if we would have to do 4 x 64byte reads instead of 42+15+10+20+42+26+0+0+42+30+0+29, then the loop logic suddenly gets significantly more complex because we have to deal with the boundary of a record maybe being halfway through a chunk, or an individual field being split across chunks.

If we can do the streaming whilst reading arbitrary bytes at a time, then we can stick with the loop (and the extra logic handling chunking goes only to the Seeker) and it will be perfect. So then you still call the Seeker "hey, I want 256 bytes of data from offset X" and consume those bytes however you like.

digitaldogsbody commented 2 years ago

Ok, that was too much. Sorry!

Never too much Diego! It's always a good learning for me :)