Open digitaldogsbody opened 2 years ago
Maybe this is an argument for devolving the actual functionality for finding the different parts back to the Seekers?
I like having the clean abstraction of the main ZipRangeReader being the only part that needs to know about how to find an EOCD etc, but we have lost the nuance of being able to implement the searching differently depending on where the resource is, and therefore possibly some efficiency as well. Hmmmmm.
@DiegoPino this is another place where your thoughts would be gratefully appreciated :)
@digitaldogsbody I think the solution here is to use the seeker to never read a single 2GB back. Basically the fetch should happen as in a streaming resources that on each invocation (while there is data get me 64Kb
) gets chunks
@digitaldogsbody one way of doing this is do implement a stream to stream copy with a persistent pointer that advances in the object created by either the seeker/or even better an internal Object to the seeker that keeps that info around. Its a bit like a state machine where the current open File from the ZIP and the cursor/pointer to the resources is kept as private class properties and implements probably seekable methods that just encapsulate methods of the seeker (move to position X, fetch XX bytes until the original range of the file is reached)
In combination with fopen and maybe memory or temp wrappers https://www.php.net/manual/en/wrappers.php
@digitaldogsbody now if we really want to get fancy .... https://www.php.net/manual/en/stream.streamwrapper.example-1.php
This would (future enhancement) would allow a ziprange://someURLorPATH/somefile
The thing here is that you can not have two streamwrappers connected, so someURLorPATH could be an s3://someresource or an http://anurl maybe double encoded? or in parenthesis? or between {} ?
Which would then expand via the ziprange streamwrapper and return a handle to the internal with a pointer/state/seekable/etc
Ok, that was too much. Sorry!
@digitaldogsbody I think the solution here is to use the seeker to never read a single 2GB back. Basically the fetch should happen as in a streaming resources that on each invocation (
while there is data get me 64Kb
) gets chunks
Ok, this makes sense. You could still ask for a subset of the data, right?
For example, when we get the Central Directory Records from the Seeker, we know the offset and total length, but then the actual parsing is of different sized chunks:
1) Read 46 bytes (fixed length record) and extract length of 3 extra fields: file name, extra, comment 2) Read len(filename) bytes 3) Read len(extra) bytes 4) Read len(comment) bytes 5) Goto 1
So lets say we have 3 records for a total size of 256 bytes, if we would have to do 4 x 64byte reads instead of 42+15+10+20+42+26+0+0+42+30+0+29, then the loop logic suddenly gets significantly more complex because we have to deal with the boundary of a record maybe being halfway through a chunk, or an individual field being split across chunks.
If we can do the streaming whilst reading arbitrary bytes at a time, then we can stick with the loop (and the extra logic handling chunking goes only to the Seeker) and it will be perfect. So then you still call the Seeker "hey, I want 256 bytes of data from offset X" and consume those bytes however you like.
Ok, that was too much. Sorry!
Never too much Diego! It's always a good learning for me :)
The change to using an Interface means that we now do stuff like request the whole CDR at once rather than just seeking to the right point and consuming (as this is a much better way of doing things in the abstract, since we don't know where the data returned by a seeker is coming from).
However, this means that with the current implementation of LocalFileSeeker, if you ask it for 2GB of data (this would be a massive CDR, but I guess it could be feasible?), it will happily read 2GB into a string in memory and return it.
I think there should be a better way of doing this, by streaming the data when it is read. Need to maybe think about implementing a data stream in the Interface?