codex-storage / nim-codex

Decentralized Durability Engine
https://codex.storage
Apache License 2.0
65 stars 25 forks source link

Remove padding during retrieval via REST API #98

Closed michaelsbradleyjr closed 2 years ago

michaelsbradleyjr commented 2 years ago

The chunker pads chunks that are smaller than BlockSize.

Currently, when a CID is retrieved via the REST API any padding is included. This can be seen when e.g. uploading a small text file via a node's REST API and then retrieving that file.

$ echo Hello World > myplain.txt

$ hexdump -b myplain.txt 
0000000 110 145 154 154 157 040 127 157 162 154 144 012                
000000c

$ curl --data-binary @myplain.txt http://127.0.0.1:8080/api/dagger/v1/upload
zdj7WWsHUroHjMrvpmXqPnyf4gBM5DHebN1MP42ocK8UKNH51

$ curl http://127.0.0.1:8080/api/dagger/v1/download/zdj7WWsHUroHjMrvpmXqPnyf4gBM5DHebN1MP42ocK8UKNH51 --output myplain2.txt

$ hexdump -b myplain2.txt 
0000000 110 145 154 154 157 040 127 157 162 154 144 012 000 000 000 000
0000010 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
0001f00

The original size of the upload should be tracked as metadata (in the manifest?) so that any padding can be removed during retrieval.

acud commented 2 years ago

Hello :wave: I've been looking into this issue. The thing that made sense at a first glance would be to change the manifest add signature to get not only the cid but also the span or size of the cid. The manifest would have a span field and we would be appending to its size every block that is added.

The only problem I found is that the manifest constructor takes a sequence of blocks and this sort of change would make a mess of the constructor (since this would have to turn into a tuple). Personally my recommendation would be to dump the cid sequence in the constructor in favor of a more explicit API. Is the constructor cid part strictly needed?

dryajov commented 2 years ago

Hey @acud! Welcome!

So, there is already a blockSize fields in the manifest and with blocks being fixed size we don't have to capture it per individual block. What we need is an additional field that captures the length of the original data, something like uploadSize or the likes. There are however a few considerations to keep in mind.

First, if you look at the original manifest definition:

type
  Manifest* = ref object of RootObj
    rootHash*: ?Cid         # root (tree) hash of the contained data set
    blockSize*: int         # size of each contained block (might not be needed if blocks are len-prefixed)
    blocks*: seq[Cid]       # block Cid
    version*: CidVersion    # Cid version
    hcodec*: MultiCodec     # Multihash codec
    codec*: MultiCodec      # Data set codec
    case protected*: bool   # Protected datasets have erasure coded info
    of true:
      K*: int               # Number of blocks to encode
      M*: int               # Number of resulting parity blocks
      originalCid*: Cid     # The original Cid of the dataset being erasure coded
      originalLen*: int     # The length of the original manifest
    else:
      discard

You'll see that if the manifest is protected, there already is an originalLen field. This is the size of the non-erasure coded manifest, when erasure coding, we wrap the original non-erasure coded manifest and extend it into a "protected", or erasure coded manifest. Since the coding is systematic, all the original blocks are also part of the erasure coded manifest plus the parity blocks. I bring this up, because it might be confusing when not having the correct context. So the field needs to be placed outside of the case declaration and should capture the size of the original data - originalSize/uploadSize could be a good name.

Second, on upload the stream is fed to the chunker that splits the incoming stream into equal sized blocks. The chunker operates in two modes, with and without padding. The default is to pad, but since we want to count all the bytes as they come, we need to make sure that the last chunk/block is unpadded and pad it ourself. A better alternative would be to allow the chunker report the un-padded size it has read up until some point (or EOF) and set that on the manifest.

So to summarize:

acud commented 2 years ago

@dryajov personally I would try to leave the chunker out of it since ideally you could capture the value directly from the data stream of the HTTP request. The chunk may generate more data due to erasure coding, but the underlying data-layer subtree node spans remains the same (read: leaking stuff from the chunker would actually not give you any added value). Also it would result in some ugly tuple return values from the chunker which is not so nice to interact with.

When looking at api.nim it seems that you can have direct access to the actual underlying data size directly when pumping the data out of the stream:

      try:
        while not reader.atEof:
          var
            buff = newSeqUninitialized[byte](BlockSize)
            len = await reader.readOnce(addr buff[0], buff.len)

          buff.setLen(len)
          if len <= 0:
            break

          trace "Got chunk from endpoint", len = buff.len
          await stream.pushData(buff)

The problem is that it is handled too high up in the stack and the context does not have access to the actual manifest (which is created in store). I would potentially just move this into the store method in node.nim and just do all the async-await magic there. It also would provide a nice clear-cut responsibility - you can just pass a stream directly into store and it will handle it all for you and just give you back the CID.

dryajov commented 2 years ago

personally I would try to leave the chunker out of it since ideally you could capture the value directly from the data stream of the HTTP request

The HTTP api/interface if just one of many ways of invoking the underlying api exposed by the node, it is a very thin layer that should be isolated from the remainder of the the stack, because it can be switched out for stdio to use as a console executable, for example. So, the HTTP layer is really not the best place to do this.

The chunk may generate more data due to erasure coding, but the underlying data-layer subtree node spans remains the same (read: leaking stuff from the chunker would actually not give you any added value).

The chunker is the only thing that sits between the incoming stream and the padded blocks being produced by it. The only other way to do this, without making significant changes to the architecture, is to do padding manually and disable it on the chunker (it's just a flag). But there is really no reason to do that and architecturally there is 0 cost to adding an accessor to the chunker that reports on the amount of consumed bytes so far. This should be a simple method/func on the chunker called something like bytesRead.

The problem is that it is handled too high up in the stack and the context does not have access to the actual manifest (which is created in store). I would potentially just move this into the store method in node.nim and just do all the async-await magic there.

There is a very good reason for not doing this, as I mentioned, this makes the entire architecture less modular and more leaky.

It also would provide a nice clear-cut responsibility - you can just pass a stream directly into store and it will handle it all for you and just give you back the CID.

Not sure I follow, from what you're describing, this is already more or less how it works. The REST layer hands a stream to the node's store method, which hands it down to the chunker that produces the actual chunks/blocks which are stored on the manifest. All that is required is to count the incoming bytes and then pass them down to the manifest.

dryajov commented 2 years ago

@acud I've gone ahead and added a PR #110 to alleviate some of the glue code required to pump data between the http layer and the node and specially to avoid an unnecessary copying (one less buffer allocation).

Also, as per our discussion, it might make sense to create a new type of stream that further augments the existing LPStream with the read and written bytes, this new stream can then further wrap the http stream and augment it with the required info. As you noted, putting this methods into the base type might end up being awkward, which I agree with, hence yet another (thin) wrapper might be the best way further.

As another options, I still kinda like exposing this information to the chunker as well. Considering everything else, it is the least costly and invasive option and it doesn't feel out of place - after all, the chunker reads data and there is no reason not to make some stats around that available.

Bulat-Ziganshin commented 2 years ago

I looked into code. My proposal:

I think it's all changes we need. Note that BlockStore will still store last block padded to blockSize, and all operations on blocks will use those padded blocks (but they will be filled up with zeros rather than garbage!)

dryajov commented 2 years ago

I think this introduces too many changes for something that can be accomplished with a very small change to the chunker itself (add a consumed field that returns the amount of bytes read) and an additional field to the manifest (originalSize).

For one, StoreStream should work on the manifest dataset, not the original file, otherwise we'd have to account for padding manually, which is going to be a pain.

The only place where the original file size matters is on download, if we store the original size in the manifest, it should be really easy to simply read up to the originalSize during download.

Btw, we don't have to return the return LPStream(StoreStream.new(node.blockStore, manifest)).success, we can return a BufferStream and push data to it from node.retrieve. Another approach is to have a flag to StoreStream that would read in padded and unpadded mode.

I agree that we can remove rabin chunking (in fact, it's just a placeholder for now), but that doesn't merit getting rid of the chunker altogether, it's also useful in tests for example.

Bulat-Ziganshin commented 2 years ago

yeah, I didn't realized that Chunker is used in other tests, not just tested by itself. So let's keep it.

StoreStream is currently used for

I vote for adding StoreStream.new parameter, bool unpadded = false but I hope you have better name for it

we can return a BufferStream and push data to it from node.retrieve

yeah, existing fetchBlocksJob scheme looks unreliable. While LPStream reads blocks sequentially, fetchBlocksJob throws batches. So, if all data in the first batch were received simultaneously, it will fullfill the first block read by LPStream, and then drop remaining blocks, so LPStream will contnue to read them one-by-one.

Funny that it may be the actual reason of slow downloading in your current test!

add a consumed field that returns the amount of bytes read) and an additional field to the manifest (originalSize

agree

dryajov commented 2 years ago

yeah, existing fetchBlocksJob scheme looks unreliable. While LPStream reads blocks sequentially, fetchBlocksJob throws batches. So, if all data in the first batch were received simultaneously, it will fullfill the first block read by LPStream, and then drop remaining blocks, so LPStream will contnue to read them one-by-one.

Yeah, this is definitely a bandaid solution to prefetch some blocks, but I'm not sure if it actually makes any difference... Something to try out.

dryajov commented 2 years ago

I vote for adding StoreStream.new parameter, bool unpadded = false but I hope you have better name for it

Yeah, I think this is probably the best approach right now, lets do this. Let's just have a default parameter called padded = true and set it to false if we want to read up to the original data size.

Bulat-Ziganshin commented 2 years ago

so, we have two independent tasks.

I think we agreed upon implementation of unpadding:

And for prefetching, we should modify fetchBlocksJob to push data into BufferStream with fixed-size queue.

Also, erasureJob doesn't look reliable either - first we need enough place to store all decoded data, otherwise this approach will fail. Second - it will require a lot of time, so reading from LPStream may fail for timeout.

Time to make new issue.

dryajov commented 2 years ago

I think we agreed upon implementation of unpadding:

Yep, sounds good and I like originalBytes better.

And for prefetching, we should modify fetchBlocksJob to push data into BufferStream with fixed-size queue.

I don't think this is what we want at least not in this context, fetchBlocksJob should really be called prefetchBlock and it's a parallel task that attempts to download some blocks ahead of time and store them in the local store, so pushing on the BufferStream doesn't make sense here.

Also, erasureJob doesn't look reliable either - first we need enough place to store all decoded data, otherwise this approach will fail. Second - it will require a lot of time, so reading from LPStream may fail for timeout.

For erasure decoding, we probably want a more intelligent downloader that fetches in recovery groups and kicks in recovery as soon as it got any K pieces. If those are the original (systematic) pieces then we're don't do anything, if they aren't the original we kick in recovery and continue downloading in parallel, then we either stop the recovery or the download, whichever succeeds first.

dryajov commented 2 years ago

I should add that I'm not entirely sure if the prefetch is required, but it seemed like a good thing to try out originally, we should definitely experiment with this.

Bulat-Ziganshin commented 2 years ago

and store them in the local store

but current NetworkStore.getBlock has only this code:

    let blk = await self.engine.requestBlock(cid)
    # TODO: add block to the local store
    return blk.some.success

so implementing TODO may be the shortest path toward real prefetching.

For erasure decoding, we probably want a more intelligent downloader that fetches in recovery groups

there is small detail however. Current layout of recovery groups is that first group covers blocks 0, X, X*2..., second group covers blocks 1, X+1, X*2+1... Your idea assumes the "layout B" that we discussed in https://github.com/status-im/codex-research/pull/85#issuecomment-1150623235

dryajov commented 2 years ago

but current NetworkStore.getBlock has only this code:

Yes, but the engine has a reference to the blockstore and when the block eventually arrives, it will be stored here - https://github.com/status-im/nim-codex/blob/main/codex/blockexchange/engine/engine.nim#L256

Your idea assumes the "layout B" that we discussed in https://github.com/status-im/codex-research/pull/85#issuecomment-1150623235

Maybe, I haven't thought about this in some time, but it is definitely another layer that we might want to consider when making the choice for the layout.

Bulat-Ziganshin commented 2 years ago

Yes, but

you won :) I made new issue for discussion of prefetching

As for this issue, we agreed on implementation, but I will wait till you and Michael will push your PRs.