Fix 1MB file limit - Githubissues

cryptoquick commented 3 years ago

@kn0wmad mentioned "Not sure if a bug or a feature, but there is currently a 1mb limit on image upload."

Can confirm, this is an issue. Displaying images in general takes a second, also, but that's a performance issue for later.

This can be used to chunk up files:

https://docs.rs/ipld-collections/0.3.0/ipld_collections/list/index.html

kn0wmad commented 3 years ago

Would it be possible as a temporary (or permanent) fix to store the file in full quality, but display only a <1MB 'preview?' (This doesn't sound simple, but if it is...)

cryptoquick commented 3 years ago

@kn0wmad No, unfortunately. It's a hard limit imposed by either sled or ipfs_embed. Craven mentioned his work on ipld-collections would be necessary, but that project doesn't work with our current version of ipfs-embed. A part of me is tempted to split the file up linearly and have a sort of index block to encode the vec, but there are a number of problems with that approach:

We'd be recreating what IPLD already does
It won't be verifiable as part of a whole unless the original file's hash was included also, which would be the case if implemented as an actual balanced MerkleDAG.

Pretty pointless to do it poorly. We'll wait for @dvc94ch to finish his work on https://github.com/ipfs-rust/ipfs-embed/pull/31 and then update ipld-collections, but even then, those are basic data structures. To get to larger files, including video and streaming video, we'll need MerkleDAG and TrickleDAG.

kn0wmad commented 3 years ago

Yeah, needs to be done right and not have any (unreasonable) limits. Thanks

dvc94ch commented 3 years ago

From what I just read merkledag and trickledag are just fancy names for any dag that uses hashes as links. The limitation is in bitswap, and it is a soft limit, you can change it by setting the MAX_BLOCK_SIZE of your StoreParams. The reason for this limit is to prevent nodes from sending on infinite stream of on the fly generated random data. It is impossible to distinguish a large block from a denial of service attack.

cryptoquick commented 3 years ago

That's very good to know, @dvc94ch !

I was making a good number of assumptions without fully understanding.

We're going to have to dive into bitswap anyway for our own custom needs, because our strategy to prevent DDoS would hinge upon payments, generally via mining crypto hashes, or micropayments to balance out the ledger.

cryptoquick commented 3 years ago

I just realized, @dvc94ch the issue goes both ways?

Proof of Work wouldn't solve the problem of a peer lying, saying, "yeah, I have that Cid", and then proceeding to blast the network with unwanted traffic... Am I correct in saying this?

dvc94ch commented 3 years ago

Yes, that is correct. Blockchains have a block size limit too, that is what drives up the transaction fees when there are a lot of transactions

cryptoquick commented 3 years ago

In that case, an even better solution could just have the 1MB block limit for thumbnails and other data that's discovered but its actual size is not known in advance, but then set the limit to a specific value on a per-call basis for, say, the IpfsEmbed.get() method. The size would then just be communicated in a thumbnail.

cryptoquick commented 3 years ago

@dvc94ch Is there a way to selectively lift the limit by specifying how many bytes I'd expect a client to receive?

dvc94ch commented 3 years ago

That would be an option which I suggested years ago, but it would require changes to the bitswap protocol. However it is still reasonable to impose a limit. If your use case is streaming large video files for example, the only way to ask for a range, without trusting the server would be to ask for the sub tree you're interested in. Another disadvantage is that deduplication will be ineffective, depending on your application that may or may not be relevant. If you are dealing with writable files, appending to a file would be fairly cheap and syncing the diff would happen automatically

cryptoquick commented 3 years ago

Good point! So, let's say I have this "thumbnail content index" struct that I use to point to a finite number of hashes.

I could even support very large files by pointing to an extension record. For example, if a Blake3 hash is 32 bytes, not counting whatever the Cid multihash encoding, then:

1MB / 32 bytes = 32768 "Cids" per index record 32768 * 1MB = 32GB

Those are rough figures, of course. But that should be enough for now.

As for content addressability, pretty much any content published in this way would be viewable only by another Fuzzr client, or one that mirrored our implementation. I also like your idea about syncing IPNS-style updatable records.

Also, in looking up how large a Blake3 hash is, check this out:

https://github.com/oconnor663/bao

Hash verified streaming? Could this be a good solution?

dvc94ch commented 3 years ago

Let's move the discussion to riot, bao streaming looks very interesting

cryptoquick commented 3 years ago

Sounds good!

kn0wmad commented 3 years ago

Not sure if I follow entirely. The way I thought about it, a user should be able to publish media of "any" size and get a CID, however, nobody would want to download or pin GBs of garbage. How is the user going to DDOS someone that doesn't subscribe to their content? Sorry my knowledge of the tech here is only basic

cryptoquick commented 3 years ago

@kn0wmad We're continuing in the conversation here: https://riot.im/app/#/room/#rust-ipfs:matrix.org/$NxY5LgnC6XpiCxTW6G_qWktEXT2uqwutDqEExFW6dW4

dvc94ch commented 3 years ago

@kn0wmad you're mixing up concepts. You can publish media of "any" size and get a CID by creating a tree of blocks. The CID of that media is the root of the tree. Subscribing to content is completely unrelated.

cryptoquick commented 3 years ago

@kn0wmad They do it about lying about the Cid hash in a way that can't be verified by the state of the digest function until after the peer stops sending data, which if left unbounded, could be approximately never. That's the problem with "block" style hash functions that have to know all the data in advance to produce a valid result. Fortunately, Blake3 seems to support a sort of streaming verifiability, I'm assuming it's similar to a Keccak sponge function in that the digest state is encoded in the hash itself. That's why we're excited about the Bao approach. If they're sending bogus data, they'd immediately be found out, since if what they were sending deviated from the hash function at any point, it'd immediately throw an error.

FuzzrNet / Fuzzr

Fix 1MB file limit #55