mitchmindtree commented 1 year ago

The following is a proposed design around decentralised package hosting and registration for forc.

Goal

The aim is to provide a user experience akin to Cargo’s package registry in a manner that avoids some of the issues associated with its centralised approach to hosting, registering and indexing packages.

Rust’s Approach

In Rust, users can easily publish packages with cargo publish, and depend on packages with simple, semver declarations e.g. serde = "1.0".

By default, packages are published to and fetched from crates.io. Crates.io acts as a canonical registry (the official record of all packages), host (src code and manifest for each version) and index (allows searching / exploring packages and discoverability).

The caveat with this design is that when crates.io is down for whatever reason, projects cannot be published or fetched by Rust users. It also means that crates.io is responsible for the content that it hosts, and must respond to licensing challenges, moderation requests, name-sitters, and so on. Cargo users must trust whoever happens to have control over crates.io and their ability to secure it from from malicious actors.

Proposed Approach

This design aims to improve upon Rust’s approach by avoiding the use of a central authority to handle package registration and hosting. Instead, we opt for delegating the registry and hosting roles to suitable decentralised tooling.

At a high level, the intent is to:

Host package content (manifest, lock and src) with IPFS, and
Register packages (publishing their content addresses) with Fuel Contracts.

Package indexing could be built upon these two components, but in order to narrow the scope of this work, implementation details around package indexing (website for search-ability, discoverability, etc) are omitted.

Package Hosting - IPFS

The core appeal of using IPFS is that hosted content is content addressed. This means that, only knowing the content address of a package (i.e. the SHA256 of the package contents), we can reliably fetch exactly that package, as long as that package is hosted by an IPFS node.

To ensure that at least one node always hosts all packages, we should host an IPFS node under Fuel Labs. This node would simply query/monitor the registry for all known packages, and ensure that all are pinned and available from at least one location on the network. This ensures that, at a minimum, we always offer at least the level of persistence offered by crates.io.

With IPFS as a foundation, there are many options for reducing the bus factor of running one node. E.g. we could operate a second node on a separate platform, use Filecoin to incentivise 3rd-party hosting, clearly document how users can run their own nodes and pin all packages to support the network, and so on. Organisations with a large investment in Fuel would be naturally incentivised to run their own package-pinning node(s) to help improve the robustness of the ecosystem.

What about Arweave?

Arweave similarly offers content addressed hosting with greater guarantees around permanent content availability. However, using Arweave would require a little extra indirection for package publishers, namely that publishing would require having an Arweave wallet and some AR to pay for the hosting. The IPFS network on the other hand is free to participate in and “publish” to allowing us to offer a smoother publishing experience, with the trade-off that we’ll need to guarantee content availability on our own as mentioned above.

The registry design below supports different kinds of content addressing in case we decide to take advantage of an alternative approach to content addressed hosting like Arweave in the future.

Package Registry - Fuel Contracts

The role of the package registry is to act as a source of truth w.r.t. what packages are available, what versions are available, who owns packages, and who is allowed to publish new versions. The Fuel network’s immutability and persistence makes it well suited to hosting this kind of info.

On-Chain Package Data

Data published to the Fuel network for a single version of a single package should roughly include the following:

Package Semver : [u32; 3]
Content Address Type (e.g. IPFS SHA256, Arweave V1, etc) : u32
Content Address for Package Metadata: b256
Content Address for Package Source : b256

The idea behind separating the metadata and source is to allow indexers to trivially fetch high-level metadata for a package without needing to fetch the entire source code. We can impose conservative upper limits on the size of the metadata so that package indexers can maintain predictable performance, and safely ignore potentially malicious packages with sizes that exceed this limit.

Our Package Metadata standard should roughly include the package name, semver, license, source CA, description, homepage and repository. Package Source should include the manifest, lock file and all source code.

Packages as NFTs

In order to allow for transfer of ownership and the approval of multiple authors for a single package, we can adopt the NFT standard.

Specifically, we could provide a abi Package: NFT { ... }, where “publishing” the first version of a new package is akin to implementing Package for a contract, deploying it and minting the first version. The owner (and approved authors) can “mint” subsequent versions of the package as unique tokens. The Package abi would abstract over minting to associate the on-chain package data mentioned above with each version, and to impose guarantees such as ensuring newly published versions are greater than old versions. The Package abi would expose a method for retrieving the package name, as well as a set of “yanked” versions so that authors can signal when certain versions were mistakes or should be ignored.

Adopting the NFT standard allows us to take advantage of NFT-aware wallet functionality, so that Sway devs can view and transfer their packages using a familiar interface without the need for us to develop a custom solution for these actions. For better or worse, it also enables packages to show up in NFT marketplaces, directly exposing packages and their owners to the markets.

Deploying a unique contract per package also reflects a parallelism-friendly design for the VM, as users would be able to publish and read different packages simultaneously without interference (which would not be possible using a single contract to manage all versions of all packages).

Package Namespacing?

Cargo’s flat package namespace is simple, inspires creative package names, and allows for succinct dependency declarations (e.g. foo = "1"). However, it opens up crates.io to the problem of name-squatting.

Crates.io’s policy on name squatting is that crates will never be removed or transferred due to squatting. However, they reserve the right to remove packages for legal reasons, or if a package violates the Rust code of conduct.

In our case, we have no ability to remove or transfer names, even if we would like to. To avoid issues associated with a flat namespace altogether, we can consider requiring that registry dependency declarations also specify the original publisher. However, when we only know the original publisher's pubkey, declarations can appear unwieldy, e.g.

foo = { publisher = "fuel1htrldg7wpelrnhvad8lwcep57duhpxmj2h6a55nhr4md4839cy2q04p3tu", version = "1" }

This could be improved with support for an ENS-like name service for Fuel (like fuel nomen). E.g.

foo = { publisher = "mindtree", version = "1" }

However, registering a name adds an extra barrier to entry for publishers. Generally, NFT-based name services also allow the transfer of ownership over names, and it’s not clear how we’d handle the case where the original publisher’s name has been transferred in a manner that doesn't hurt forc's reproducibility. The result is also still more verbose than the original declaration that can be achieved when omitting namespacing:

foo = "1"

Seeing as deploying and registering package contracts will cost gas, this at least partially disincentivizes arbitrarily harvesting and squatting names compared to crates.io. However, we may want to also consider some approach to “rate-limiting” package registration in some manner to further disincentivize mass name harvesting in a single block.

Whether we go for a flat or publisher-namespaced approach is left as an open question, however for simplicity the remainder of this design assumes a flat namespace.

Deterministic Package Contracts

Ideally, it would be possible to deterministically generate the package contract solely from the SHA256 of the package’s name. If we could deterministically produce the contract, then we could theoretically also deterministically produce the contract ID.

Being able to consistently determine the package contract ID solely from the package name would enable forc publish to trivially determine whether or not a contract has already been deployed for a package, or if it needs to deploy one before publishing the current version.

The generated package contract might look something like the following:

//! Package contract for <package-name> generated by `forc publish`.

use forc_pkg::Package;
use sway_libs::nft::NFT;

impl Nft for Contract {
    // ...
}

impl Package for Contract {
    fn name_sha256() -> b256 {
        // SHA256 of the package name.
    0xa1103cefb7e2ae0636fb33d3cb2a9e4aef86afa9696cf0dc6385e2c407a6e159
    }
    // Methods for publishing versions, fetching latest version adhering semver, etc...
}

The major caveat here is that the bytecode generated by forc for this same contract may change between forc versions in the case that some optimisation is enabled, or certain fields are re-ordered, etc. This would result in a different contract ID, meaning we cannot safely rely on this approach when publishing packages or fetching dependencies.

Instead, we’ll likely need a single, central registry contract solely to act as a map from name to contract ID.

The Forc Package Registry Contract

While individual packages would be published under unique package contracts, we may still benefit from a single, central registry contract that records the mapping from name to contract ID for all known packages. This may also remove the need to index the entirety of the fuel blockchain to know what packages are available.

This registry contract would maintain a mapping to assist directly looking up packages:

type PkgNameSha256 = b256;

storage {
    registered: StorageMap<PkgNameSha256, ContractId> = StorageMap {};
}

enum RegisterError {
    PackageContractAlreadyRegistered: (),
    PackageContractDoesNotExist: (),
    PackageContractNameDoesNotMatchGivenName: (),
    MsgSenderDoesNotOwnPackageContract: (),
}

abi ForcPackageRegistry {
    /// Register the contract at `contract_id` for the given package name.
    #[storage(write)]
    fn register(pkg_name: PkgNameSha256, contract_id: ContractId) -> Result<(), RegisterError>;
    /// Given a package name, produce the package's contract ID if registered.
    #[storage(read)]
    fn contract(pkg_name: PkgNameSha256) -> Option<ContractId>;
}

Maintaining this contract requires that when the first version of a new package is published and its package contract is deployed, its contract ID is registered within this registry contract.

An alternative to maintaining a central registry contract might be to rely on fuel blockchain indexers to track the existence of all forc packages. This would mean that during publishing and fetching, forc would require access to one of these indexers, kind of defeating our decentralised approach. Forc itself could maintain an index, but this could be expensive, particularly on first use as forc would need to fetch or build the index from scratch.

Publishing a Package

Upon forc publish, we first query the registry contract to determine whether or not a contract exists for our package. If not, we deploy and register the package contract.

While querying whether or not the package contract is deployed, we host the package’s metadata and source on IPFS. Once the package contract is deployed and registered, we publish the version and content addresses to the package contract.

graph TD
  FP(`forc publish`)
  Q(Query the <br>Registry Contract<br>for an existing<br>Package Contract)
  H(Host the<br>Package Metadata<br>and Package Source<br>on IPFS)
  D(Deploy the<br>Package Contract)
  R(Register the<br>Package Contract<br>with the<br>Registry Contract)
  C((Package Contract<br>Registered))
  P(Publish the<br>Package semver<br>and Content Addresses<br>to the Package Contract)
  FP --> Q
  FP --> H
  Q -.Package Contract<br>Not Registered.-> D
  D -.-> R
  R -.-> C
  H --> C
  C --> P
  Q -.Package Contract<br>Already Registered.-> C

Forc should log each of these steps as they begin, providing useful feedback in the case that any steps fail. Clean CLI wallet integration will be essential for enabling a smooth UX.

Publishing requirements would match those of cargo, namely that all required metadata is present, and all dependencies are also registry dependencies. Notably, cargo does allow for git dependencies pinned to specific commits. We could potentially offer similar support by implicitly publishing these dependencies somewhere under the package contract, but this can be left for future design work.

Fetching a Package

Upon the first build (or forc update) for a registry dependency foo = "1.2", forc first queries the registry contract for foo's package contract ID. Next, foo's package contract is queried with the given semver for the latest semver-adhering version of foo along with the content addresses for its metadata and source. Finally, we use the content address to fetch the source, cache it locally and pin foo to the full version (e.g. 1.2.3).

sequenceDiagram
  participant forc
  participant fuel
  participant ipfs
  Note over forc: Initial `forc build`<br>or `forc update`
  forc->>+fuel: Get package<br>contract ID for `foo`<br>from Registry Contract
  fuel->>-forc: `foo`'s package contract ID
  forc->>+fuel: Get versioned package<br>for `foo` at `1.2`<br>from `foo`'s Package Contract
  fuel->>-forc: `foo 1.2.3`<br>with metadata and source<br>content addresses.
  forc->>+ipfs: Get `foo 1.2.3` package source<br>using content addresses
  ipfs->>-forc: `foo` source code
  Note over forc: Cache source code<br>for content address
  Note over forc: Fetch dependencies<br>of `foo 1.2.3`...
  Note over forc: Pin `foo` to `1.2.3`<br> in Cargo.lock

Subsequent calls to forc build would be near instantaneous as foo would be pinned and its source cached.

If necessary, we could speed up the registry dependency fetching by using a centralized package index as future work. We can fall back to the slower, multi-step node client interaction in the case that the index is unreachable.

Side Benefits

Forc itself becomes a practical demonstration of using the Fuel Network. A cool fact for PR, etc.
The tooling team becomes responsible for directly maintaining an important Sway contract, meaning we have an even greater awareness of and motivation to address subtle UX nits and so-on. I.e. we would eat our own dog food.
Directly drives progress on the forc-wallet CLI plugin. This UX will need to be seemless if it is to become core to the package publishing process.
Simplifies the creation of a package index ala crates.io or libs.rs. An implementation would look like a simple fuel indexer with front-end for discoverability.

Implementation

Here is a rough overview of the anticipated steps involved:

[x] Ensure forc-wallet can show balance, UTXOs https://github.com/FuelLabs/forc-wallet/issues/68.
[ ] Implement IPFS support in forc (as an alternative to git or path). E.g.
```
foo = { ipfs = "QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG" }
```
3926.
[ ] Design a Package abi adhering to the design above with an associated package metadata and source specs.
[ ] Create a Package contract implementation template that can be used to generate package contracts.
[ ] #4686
[ ] Add basic support for registry dependencies, initially requiring that a contract’s ID is specified in order to test the PackageContract + IPFS workflow.
[ ] Create and deploy a RegistryContract and integrate it into the publish command flow, removing the need for specifying the contract ID as it would now be fetched from the RegistryContract.

Questions

How to handle network upgrades? E.g. can we migrate packages from beta-2 to beta-3? Perhaps forc should default to the latest network, but allow individual dependencies to be fetched from different networks?
Is a flat namespace feasible? If not and we prefer namespacing via the publisher, how do we solve name-service issues mentioned above? Or maybe it’s fine to specify the full pubkey of the original publisher? Rather than dealing with this extra indirection, maybe we’d be better off with our own name-service approach for the flat namespace? I guess the registry contract proposed above is already a minimal implementation of such a service?
Is the robustness and cool-factor of this decentralisation worth it? E.g. a simpler registry ala crates.io that acts as a glorified wrapper over forc’s existing git functionality would likely be a bit easier to implement and would not require any fee to publish, though would be less exciting/robust and would impose greater responsibility on fuel labs as a maintainer of the registry.

freesig commented 1 year ago

I think it would be worth adding a section on how you will verify the integrity of files. There would need to be some way to know hashes of the files (It seems IPFS hash != content hash) are correct before you fetch the files. Then you would need to verify the file matches the content hash before you run any code and delete it if it doesn't match.

mitchmindtree commented 1 year ago

There would need to be some way to know hashes of the files (It seems IPFS hash != content hash) are correct before you fetch the files

As far as I understand, an IPFS content address is a hash of the content and IPFS does verify the integrity of the fetched content by checking the hash.

freesig commented 1 year ago

I guess it would depend on the client you use?

mitchmindtree commented 1 year ago

True, it'll be good to keep in mind that whatever approach we take, we should ensure that the integrity of fetched content is actually being checked. If it turns out our client of choice does not, I don't imagine it should be too tricky to add an extra step that uses ipfs to check the content address ourselves.

I would like to hope all clients would do this by default considering the P2P nature of IPFS, but you're right we shouldn't assume :smiling_face_with_tear:

kayagokalp commented 1 year ago

First of all the proposal looks incredibly cool, very very excited for this 🎩

Generally, NFT-based name services also allow the transfer of ownership over names, and it’s not clear how we’d handle the case where the original publisher’s name has been transferred in a manner that doesn't hurt forc's reproducibility.

I am curios about whether the transferability of name would create an attack vector in this case. In a scenario where there are ultra popular libraries published by publisher_a and the name gots transferred to a malicious 3rd party, they can inject any code they want and publish a patch. So a malicious dependency can be introduced to lots of package. Since removing packages is not possible, in the case of malicious activity, it would require some kind of social coordination so that any contract that depends on the malicious patch is not deployed. I wonder if this something to consider, if so what should we do about it 🤔

mitchmindtree commented 1 year ago

I am curios about whether the transferability of name would create an attack vector in this case.

It's worth keeping in mind we already face this issue with cargo today! E.g. cargo allows for transferring ownership of a crate, and it's really up to downstream maintainers (or auditing tools) to keep track of changes between updates.

As you've pointed out, the key difference is that the crates.io team can and do occasionally remove crates once identified as malicious, whereas we would be unable to remove malicious packages. That said, it's unlikely the crates.io team are able to keep track of all of the malicious packages that are out there and that their approach to removing malicious packages is more of a best-effort attempt.

We can possibly improve the experience in forc by highlighting transfers of ownership during updates to the set of pinned packages (i.e. when the Cargo.lock is updated during forc update). Rather than only showing the version change for each dependency, perhaps we could also show the number of new authors that have gained publish rights since the previous versions? This might help to highlight when close attention should be paid to a version change.

In the long run we could also consider hosting some security advisory automation, similar to rustsec, that tracks known vulnerabilities or issues in the ecosystem. We could potentially integrate something like this into forc itself, where warnings are raised during the pinning of known bad packages.

Ultimately, I think it's up to Sway devs and publishers to be responsible for vetting their own dependencies and dependency updates before deployment, but we should do all we can to make this easier :+1:

nfurfaro commented 1 year ago

Maybe not needed, but this issue reminded me of https://ipld.io/ (a hash-based cross-protocol data format from the same team as IPFS). Lots to digest here but I love the overall idea. I need to consider the whole NFT idea a bit more carefully than my initial read through, but I like that it leverages blackchain native tech.

kayagokalp commented 1 year ago

Couple of initial thoughts on the implementation:

We need a way to publish folder structures and unfortunately rust ipfs ecosystem seems a little all over the place so we may need our own implementation for this
As it does not introduce a great deal of complexity we can try to do better than simply zipping all the package and pushing as blob but this is in our pocket as a fallback plan

I created this initial design diagram attached to this and summary is down there.

While publishing:

Shard/Partition files into N byte blocks (IIRC Mitch mentioned unixfs shards into blocks as well) -- I think this does not introduce any more complexity to putting them as single blocks but enables us to fetch a single file in parallel in later iterations. This is not that important as we have small files. But it also makes creating a good diff much simpler. We can just obtain latest published version's metadata and look for CIDs there. And our current version in hand, and just deploy the diff. So this would enable us to make diffs not just in file level but also in N bytes blocks level
Deploy each piece/collect CIDs
Construct forc-ipfs-package-descriptor. This file is the root of our DAG and provides a way for forc to do fetch and unpack operations at the first place.
This descriptor will be deployed and its CID will be stored on on-chain registry contract that we planned in the meeting. This can be a simple json file. I am also attaching an example for that. Even with some redundant fields (like type field in the example json) this is not a size overhead as it would be in the size of bytes. In extreme cases might be kilobytes maybe?

While fetching:

Forc goes to to the registry contract gets package's descriptor file's CID and fetches the descriptor. — For initial impl (before Bing is done with his formatter stuff) we can simply insert root descriptor CID into forc.toml
Go to each piece described by the descriptor and fetch them.
Merge the pieces and create files/folders.
We can check hash of the collected package to ensure that we got everything correctly. To be able to do so descriptor file needs a hash field or maybe the on-chain contract

Size of the N for partitioning can be determined depending on the overhead. If it is a lot we can increase it and it would automatically mean that we are publishing most of the files as a single block. As they are probably in bytes not kilobytes.

design_with_sharding

And an example metadata file can be like:

{
  "name": "package_name",
  "type": "folder",
  "child": [
    {
      "name": "src",
      "type": "folder",
      "child": [
        {
          "name": "main.sw",
          "type": "file",
          "id": 0,
          "CID": "XXX"
        },
        {
          "name": "main.sw",
          "type": "file",
          "id": 1,
          "CID": "XXX"
        },
        {
          "name": "main.sw",
          "type": "file",
          "id": "N",
          "CID": "XXX"
        }
      ]
    },
    {
      "name": "forc.toml",
      "type": "file",
      "id": 0,
      "CID": "XXX"
    },
    {
      "name": "forc.toml",
      "type": "file",
      "id": 1,
      "CID": "XXX"
    },
    {
      "name": "forc.toml",
      "type": "file",
      "id": "N",
      "CID": "XXX"
    },
    {
      "name": "forc.lock",
      "type": "file",
      "id": 0,
      "CID": "XXX"
    },
    {
      "name": "forc.lock",
      "type": "file",
      "id": 1,
      "CID": "XXX"
    },
    {
      "name": "forc.lock",
      "type": "file",
      "id": "N",
      "CID": "XXX"
    }
  ]
}

Current Status

In the last meeting we had about this design, we mentioned it might be good to be compatible to go-ipfs implementation. I played around ipfs-embed today to be able to get it fetch a single file from a go-ipfs node with its compat feature, related issue is here. I haven't succeeded yet. But from what I understand the main design philosophy of ipfs-embed is a little different from go counterpart. I feel like ipfs-embed is more geared towards networks of devices with a specific purpose rather than a general network like the go implementation. This does not conflict with our design as it is well possible for our package hosting network to be a separate network. I am still learning details about IPFS, but from what I understand so far, we might be needing a bootstrap node. A node that every node can visit at the beginning to learn more about possible peers in the network. If we go this way, we may need to host the bootstrap node. If we can get ipfs-embed to work nicely with go-ipfs implementation we could use default bootstrap nodes they pack with.

I also looked around the space to see if there is an out-of-the box compatible implementation that we can use as a drop in replacement. As we also discussed in the previous meeting iroh was a contender but they decided to take a different path and break compatibility with go implementation. But before they started working on their implementation they also left the old version as a renamed library, beetle. They stated that this is compatible with go-ipfs implementation to a degree.

I think we first need to be able to fetch and publish a single file in a setup, and on top of that we should settle on the folder structure. We can push to use unix-fs etc as well. I started implementation from designing that but it looks like I need to get the basics done first and actually achieve publishing a single piece of data and get it back on an another instance first 😄

SilentCicero commented 1 year ago

Leaving a note here that the registry design could be more stateless using a log based approach if we more heavily leverage indexing.

FuelLabs / sway

forc: Package Hosting and Registry Design (using IPFS and the Fuel Network) #3752

Goal

Rust’s Approach

Proposed Approach

Package Hosting - IPFS

What about Arweave?

Package Registry - Fuel Contracts

On-Chain Package Data

Packages as NFTs

Package Namespacing?

Deterministic Package Contracts

The Forc Package Registry Contract

Publishing a Package

Fetching a Package

Side Benefits

Implementation

3926.

Questions

Current Status