Closed vadimstruts closed 6 months ago
I'm not sure block & CAR approaches are necessarily mutually exclusive. I was interested in using partial CARs (entity scope), but many gateways don't support that. So the plan was to simultaneously request the next block from several gateways, and a CAR request (rooted at the first missing block). In this model it's kinda block approach, but sometimes getting a number of blocks all at once.
According to today's meeting, I would like to summarize the information: Described investigation of the possible ways to implement Trustless client with using third party libraries.
rs-car
was denied, according to Chromium restrictions).Proposed next steps to follow:
Investigation notes (working notes, draft notes): https://docs.google.com/document/d/1MtKOoLSexGkE3lNMQpyyuDx3-wY6MowmMNQKRZNEtzo/edit?pli=1
It must provide verification of the blocks, when block data is requested. without loading the block content to the memory
Verifying it involves hashing the content, and I'm not sure how you'd hash something that's not in memory, so combining these two points makes me think you're anticipating CAR files containing data you're not currently using. And to verify the URL matches the data you'd need a chain of blocks coming all the way from the content root... if you're not already then I'd like to encourage you to consider using a dag-scope that isn't all
since some websites can contain an awful lot of data.
define Cid of the start point to walk
I'm guessing from this that by 'walking' a CAR you don't mean iterating the blocks in the order they appear in the archive, necessarily, but pathing through the DAG (for example - is this where you would implement _redirects and symlink handling?).
cache for retrieved CAR files
Are you planning each cache entry to be a CAR with many blocks? I'm curious what the key would be - how would you determine which CAR files might contain a block you need for your current request's DAG?
Sorry for delay, just wanted to check couple things.
Verifying it involves hashing the content, and I'm not sure how you'd hash something that's not in memory
I want to parse CAR file, just by read the DAG structure and index offsets to know where every block starts and ends, I don't plan to verify hashes at the same time. But later, when I find required file in the DAG, I will have all information where is every block located and I can request data for every block of the required file, and verify at the same time.
I'd like to encourage you to consider using a dag-scope
Thanks will do, It is related to requesting the CAR file, I will work on it little bit later.
I'm guessing from this that by 'walking' a CAR you don't mean iterating the blocks in the order they appear in the archive,
I plan to parse CAR file first, just to read DAG structure and know offsets of every block. Later I can walk by the CAR file like I want, or find what I want.
Are you planning each cache entry to be a CAR with many blocks? I'm curious what the key would be - how would you determine which CAR files might contain a block you need for your current request's DAG?
I think the indexing of the DAG structure (all except the file content), I mentioned above, and sure the CAR files itself, can represent cache in common meaning. It means that yes we can find everything there, and verify only data which we extract.
# | Library | Version |
---|---|---|
1 | libipld | 0.16.0 |
2 | libipld-core | 0.16.0 |
3 | libipld-cbor | 0.16.0 |
4 | libipld-pb | 0.16.0 |
5 | libipld-json | 0.16.0 |
6 | libipld-macro | 0.16.0 |
7 | libipld-cbor-derive | 0.16.0 |
8 | blake2s_simd | 1.0.2 |
9 | cfg-if | 1.0.0 |
10 | cid | 0.10.1 |
11 | constant_time_eq | 0.3.0 |
12 | multihash | 0.18.1 |
13 | quick-protobuf | 0.8.1 |
I think I have to question the meaning of Block::IsRoot()
. Being a root is not really a characteristic of the block itself but rather how the block is being treated. The root I think you'd be most interested in is the content root of the URL you have from the user (or indirectly through a subresource), that is to say the block pointed at by their URL's origin. That's the root you need to know for the sake of IPFS symlinks & _redirects. This may line up with the root you're requesting from a gateway, but it doesn't always have to.
For example, let's say you loaded
ipfs://bafybeifpdohhfv34hvepjalasz7luudx62wynzfwawxipfe4ixfxakfp7y/links.html
Perhaps you did this with a CAR request, which gave you
bafybeifpdohhfv34hvepjalasz7luudx62wynzfwawxipfe4ixfxakfp7y
- the content root, which tells you links.html is bafkreidi53io2tethdu6fcyhxeq6ncluxjxugsppj6er3z4p4slnvstehy
and old is bafybeiehttj335tokpsdam3h7igzxlill7zndlhfovktf7pcq7gqg4g3uy
bafkreidi53io2tethdu6fcyhxeq6ncluxjxugsppj6er3z4p4slnvstehy
which is the content of links.htmlThe roots here all line up.
Then rendering that page requests a subresource ipfs://bafybeifpdohhfv34hvepjalasz7luudx62wynzfwawxipfe4ixfxakfp7y/old/lily.jpg
You wouldn't want to do a CAR request rooted at bafybeifpdohhfv34hvepjalasz7luudx62wynzfwawxipfe4ixfxakfp7y
, because you already have that block so you wouldn't re-request it. You don't have the block for old, so you start there and request
ipfs://bafybeiehttj335tokpsdam3h7igzxlill7zndlhfovktf7pcq7gqg4g3uy/lily.jpg
Now, to you, the content root is still bafybeifpdohhfv34hvepjalasz7luudx62wynzfwawxipfe4ixfxakfp7y, but the only root the gateway sees for this request is bafybeiehttj335tokpsdam3h7igzxlill7zndlhfovktf7pcq7gqg4g3uy.
I do acknowledge it's not super likely the same block will appear as a content root in one site and not in another. It's kinda rare for a normal-ish user to have requests for sites that contain other sites and also those other sites. The closest thing to a realistic case I can think of is someone loading a preview page for an NFT that contains an image, and then opening the image itself (raw) in a separate tab (in which case your jpeg or whatever is its own root).
Tangentially, I would encourage you not to let the information about byte offset into a particular CAR file escape the CAR reader, since nothing else should want to know and it could get confusing since its offset will be different when that block appears in different CARs and there will be no offset when it doesn't come from a CAR. And it really just shouldn't be important since you're copying the bytes into a new owning structure anyhow.
Similarly, I think it could be inconvenient down the road to require that the CID appear in contiguous memory with the block content in order to decode it. The CID would be separate if you were doing a raw block request and (I would hope) if you're looking it up in cache. It wouldn't be difficult to have a convenience function for CAR parsing that parses out the CID from a contiguous array of bytes like this and then calls a decode function that takes the CID as a separate parameter.
- Finished with preparing UML-like diagram for the C++ part (It is can be changed in details, or extended later)
- started to work on CAR file requester, at the same time the support of the the mixed mode (CAR + RAW blocks) is possible.
- Work on combining and decoding blocks
- Branch: https://github.com/brave/brave-core/tree/ipfs-trustless-client
Description
When Brave is configured to use a public gateway, enforce checking IPFS hashes automatically. Currently there is several direction for the IPFS Trustless client implementation:
Some additional information from the discussion and analysis details about each of the approaches, have been added to separate Google doc and opened it for commenting by the next link: https://docs.google.com/document/d/1qkWksiSTim24sxUHraNJzHKA-_Pu8mMF-JaEvKMoIlQ/edit?usp=sharing