Brave IPFS Trustless client

vadimstruts commented 11 months ago

Description

When Brave is configured to use a public gateway, enforce checking IPFS hashes automatically. Currently there is several direction for the IPFS Trustless client implementation:

Using Local node configured as offline node.
Using Block approach (downloading blocks from the gateway and traversing through the graph)
Using CAR approach (downloading whole DAG as a CAR file from the gateway and traversing through the graph)
May be using some third party tool for validation of the CAR file

Some additional information from the discussion and analysis details about each of the approaches, have been added to separate Google doc and opened it for commenting by the next link: https://docs.google.com/document/d/1qkWksiSTim24sxUHraNJzHKA-_Pu8mMF-JaEvKMoIlQ/edit?usp=sharing

John-LittleBearLabs commented 11 months ago

I'm not sure block & CAR approaches are necessarily mutually exclusive. I was interested in using partial CARs (entity scope), but many gateways don't support that. So the plan was to simultaneously request the next block from several gateways, and a CAR request (rooted at the first missing block). In this model it's kinda block approach, but sometimes getting a number of blocks all at once.

vadimstruts commented 10 months ago

According to today's meeting, I would like to summarize the information: Described investigation of the possible ways to implement Trustless client with using third party libraries.

The preferred way is using Rust IPLD library: https://github.com/ipld/libipld/tree/master/macro
Only required function should be bridged to C++
Downloading of the CAR file, walking through the blocks of the CAR file, content verification, extraction etc, should be implemented on C++ side (According to the review, In the first version of the trusted client implementation, using of the Rust library: rs-car was denied, according to Chromium restrictions).

Proposed next steps to follow:

Add to Brave third party the Rust Ipld library (and all it's dependencies).
and Provide a set of C++ helper functions which are based on Rust IPLD library. It could be interfaces for codecs etc,
Add interception for redirection to the IPFS gateway link, and implement the requesting of the CAR file from the IPFS gateway's. At the same time, we need to have cache for retrieved CAR files
Implement on the C++ side, the block orchestrator for a CAR file, which should provide walking by the CAR without loading the block content to the memory and with an ability to define Cid of the start point to walk. It must provide verification of the blocks, when block data is requested.

Investigation notes (working notes, draft notes): https://docs.google.com/document/d/1MtKOoLSexGkE3lNMQpyyuDx3-wY6MowmMNQKRZNEtzo/edit?pli=1

John-LittleBearLabs commented 10 months ago

It must provide verification of the blocks, when block data is requested. without loading the block content to the memory

Verifying it involves hashing the content, and I'm not sure how you'd hash something that's not in memory, so combining these two points makes me think you're anticipating CAR files containing data you're not currently using. And to verify the URL matches the data you'd need a chain of blocks coming all the way from the content root... if you're not already then I'd like to encourage you to consider using a dag-scope that isn't all since some websites can contain an awful lot of data.

define Cid of the start point to walk

I'm guessing from this that by 'walking' a CAR you don't mean iterating the blocks in the order they appear in the archive, necessarily, but pathing through the DAG (for example - is this where you would implement _redirects and symlink handling?).

cache for retrieved CAR files

Are you planning each cache entry to be a CAR with many blocks? I'm curious what the key would be - how would you determine which CAR files might contain a block you need for your current request's DAG?

vadimstruts commented 10 months ago

Sorry for delay, just wanted to check couple things.

Verifying it involves hashing the content, and I'm not sure how you'd hash something that's not in memory

I want to parse CAR file, just by read the DAG structure and index offsets to know where every block starts and ends, I don't plan to verify hashes at the same time. But later, when I find required file in the DAG, I will have all information where is every block located and I can request data for every block of the required file, and verify at the same time.

I'd like to encourage you to consider using a dag-scope

Thanks will do, It is related to requesting the CAR file, I will work on it little bit later.

I'm guessing from this that by 'walking' a CAR you don't mean iterating the blocks in the order they appear in the archive,

I plan to parse CAR file first, just to read DAG structure and know offsets of every block. Later I can walk by the CAR file like I want, or find what I want.

Are you planning each cache entry to be a CAR with many blocks? I'm curious what the key would be - how would you determine which CAR files might contain a block you need for your current request's DAG?

I think the indexing of the DAG structure (all except the file content), I mentioned above, and sure the CAR files itself, can represent cache in common meaning. It means that yes we can find everything there, and verify only data which we extract.

vadimstruts commented 9 months ago

Short update:

Added Rust Ipld library to brave third party. Here is list Library + dependencies.

#	Library	Version
1	libipld	0.16.0
2	libipld-core	0.16.0
3	libipld-cbor	0.16.0
4	libipld-pb	0.16.0
5	libipld-json	0.16.0
6	libipld-macro	0.16.0
7	libipld-cbor-derive	0.16.0
8	blake2s_simd	1.0.2
9	cfg-if	1.0.0
10	cid	0.10.1
11	constant_time_eq	0.3.0
12	multihash	0.18.1
13	quick-protobuf	0.8.1

Created several functions for extracting information from the CAR file blocks.
Started to design C++ part (request interceptions and orchestrating of the blocks), then will start to implement it.

vadimstruts commented 9 months ago

Finished with preparing UML-like diagram for the C++ part (It is can be changed in details, or extended later)
started to work on CAR file requester, at the same time the support of the the mixed mode (CAR + RAW blocks) is possible.
Work on combining and decoding blocks
Branch: https://github.com/brave/brave-core/tree/ipfs-trustless-client

John-LittleBearLabs commented 8 months ago

I think I have to question the meaning of Block::IsRoot(). Being a root is not really a characteristic of the block itself but rather how the block is being treated. The root I think you'd be most interested in is the content root of the URL you have from the user (or indirectly through a subresource), that is to say the block pointed at by their URL's origin. That's the root you need to know for the sake of IPFS symlinks & _redirects. This may line up with the root you're requesting from a gateway, but it doesn't always have to.

For example, let's say you loaded

ipfs://bafybeifpdohhfv34hvepjalasz7luudx62wynzfwawxipfe4ixfxakfp7y/links.html

Perhaps you did this with a CAR request, which gave you

bafybeifpdohhfv34hvepjalasz7luudx62wynzfwawxipfe4ixfxakfp7y - the content root, which tells you links.html is bafkreidi53io2tethdu6fcyhxeq6ncluxjxugsppj6er3z4p4slnvstehy and old is bafybeiehttj335tokpsdam3h7igzxlill7zndlhfovktf7pcq7gqg4g3uy
bafkreidi53io2tethdu6fcyhxeq6ncluxjxugsppj6er3z4p4slnvstehy which is the content of links.html

The roots here all line up.

Then rendering that page requests a subresource ipfs://bafybeifpdohhfv34hvepjalasz7luudx62wynzfwawxipfe4ixfxakfp7y/old/lily.jpg

You wouldn't want to do a CAR request rooted at bafybeifpdohhfv34hvepjalasz7luudx62wynzfwawxipfe4ixfxakfp7y, because you already have that block so you wouldn't re-request it. You don't have the block for old, so you start there and request

ipfs://bafybeiehttj335tokpsdam3h7igzxlill7zndlhfovktf7pcq7gqg4g3uy/lily.jpg

Now, to you, the content root is still bafybeifpdohhfv34hvepjalasz7luudx62wynzfwawxipfe4ixfxakfp7y, but the only root the gateway sees for this request is bafybeiehttj335tokpsdam3h7igzxlill7zndlhfovktf7pcq7gqg4g3uy.

I do acknowledge it's not super likely the same block will appear as a content root in one site and not in another. It's kinda rare for a normal-ish user to have requests for sites that contain other sites and also those other sites. The closest thing to a realistic case I can think of is someone loading a preview page for an NFT that contains an image, and then opening the image itself (raw) in a separate tab (in which case your jpeg or whatever is its own root).

Tangentially, I would encourage you not to let the information about byte offset into a particular CAR file escape the CAR reader, since nothing else should want to know and it could get confusing since its offset will be different when that block appears in different CARs and there will be no offset when it doesn't come from a CAR. And it really just shouldn't be important since you're copying the bytes into a new owning structure anyhow.

Similarly, I think it could be inconvenient down the road to require that the CID appear in contiguous memory with the block content in order to decode it. The CID would be separate if you were doing a raw block request and (I would hope) if you're looking it up in cache. It wouldn't be difficult to have a convenience function for CAR parsing that parses out the CID from a contiguous array of bytes like this and then calls a decode function that takes the CID as a separate parameter.

Finished with preparing UML-like diagram for the C++ part (It is can be changed in details, or extended later)

started to work on CAR file requester, at the same time the support of the the mixed mode (CAR + RAW blocks) is possible.

Work on combining and decoding blocks

Branch: https://github.com/brave/brave-core/tree/ipfs-trustless-client

brave / brave-browser

Brave IPFS Trustless client #34840

Description

Short update: