Closed gammazero closed 11 months ago
Attention: 60 lines
in your changes are missing coverage. Please review.
Comparison is base (
5742fa8
) 73.94% compared to head (0bd67a0
) 73.83%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I think this is OK and solves the bulk of the issues at the singularity level, but unless I'm not looking at it right, you're going to end up having to do basically the same style of thing in motion. Over there, the singularity blob reader still only implements a basic io.ReadSeeker
and motion uses http.ServeContent
so it'll still end up doing multiple small reads through io.CopyN
for each range requested. That'll end up here, but each of those reads will instantiate a new version of all of this and we're back to lots of smaller reads and no ability to take advantage of WriteTo
.
So what to do about that? I think you're going to have to do something persistent in motion that hangs on to a RetrieveFile
reference and is able to do a WriteTo
over there as well. That could be possible, SingularityReader
is held on to for the entirety of the operation so if it gets a WriteTo
and can do this same juggling with a handle into singularity then we might solve this all.
But then after all that I guess you have the complexity question—is it better to solve this through the abstractions that let us continue to use http.ServeContent
(and many other utilities and operations that assume standard io
interfaces), or would it be simpler to parse the ranges and pass them down, returning a single reader that does it all? :shrug:
I'm against having an actual read happening inside Seek(). If we really need to add this optimization, I'd recommend adding a field seekMode which allows the caller to choose the behavior, so that ...
I agree, as that seems like the more correct/expected behavior. I have removed that handling of a seek within a read, as well as the corresponding handling here. Now the code does not continue reading from the same stream (pipe reader) following a Seek.
Worth clarifying this point because I think it's kind of important to what's being achieved here:
This looks inefficient, but given http.ServeContent will never seek again once Read() starts, this path will never be reached
This is unfortunately not true because as long as we accept HTTP spec semantics of Range
headers then we can get stacked ranges in the same query, and http.ServeContent
will faithfully skip around the ranges as requested, seeking each time: https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/net/http/fs.go;l=325-337
I hate this about the spec and would love to know what utility it has for general use, and specifically whether it's even going to be useful to Motion users. Perhaps we can't know this ahead of time and we just have to accept it. My inclination would be to ditch the Go Range
handling entirely and just accept a singular range, rejecting requests with commas in the range request and then fix it later if we find a tool/user of motion that needs something more sophisticated.
But, we can also just go along with it, and I think the solution @gammazero has here gets at this problem.
Optimize retrieval so that when requested retrieval ranges do not align with singularity file ranges, only the minimal number of retrieval requests are made.
This is accomplished by creating a separate reader for each singularity file range. For reads that are larger than a range, multiple ranges are read until the read request is satisfied or until all data is read. For reads smaller than the amount of data remaining in the range, the range reader is maintained so that it can continue to be read from by subsequent reads.
This approach associates a reader with each Singularity file range, and not the ranges requested via the API (in HTTP range header). This avoids needing to parse the range header in order to create readers where each reads some number of Singularity ranges. Rather, as arbitrary requested ranges are read, an existing reader for the corresponding singularity range(s) is reused if the requested range falls on a singularity range from a previous read. This also means that there is only a single retrieval for each singularity range, whereas if readers were associated with requested ranges then multiple readers could overlap the same singularity range and require multiple retrievals of the same range.
Fixes https://github.com/data-preservation-programs/singularity/issues/366 Fixes https://github.com/filecoin-project/motion/issues/143
As an optimization, only one singularity range reader is maintained at a time. This works because once a new singularity range is selected by the requested range read, then it is highly unlikely that a subsequent read request will fall on a a singularity range that was already read from, previous to the new one.
Additional changes:
filecoinReader
implementation supports theio.WriteTo
interface to allow direct copying to anio.Writer
.FilecoinRetriever
interface supports theRetrieveReader
function that returns anio.ReadCloser
to read data from.Benchmark
Added a benchmark of filecoin retrievals to compare before and after optimization.
Benchmark uses a file composed of 4 sections, each 16Mib in size. The entire file is retrieved by requesting 1Mib chunks. The 1Mib reads are done through io.CopyN, which copies the data through a 32k buffer.
The non-optimized version does a retrieval for each buffer copy to copy the file data. The optimized version only does as many retrievals as there are independently retrievable sections of the file.
Before optimization:
After optimization: (PR https://github.com/data-preservation-programs/singularity/pull/404)