AuHau / dpip

Distributed PIP - installing Python in distributed manner
MIT License
4 stars 0 forks source link

Initial discussion #1

Open AuHau opened 5 years ago

AuHau commented 5 years ago

This issue serves as a hub for initial discussion. I am here presenting my thoughts on how the implementation could be carried out. The project shares similarities and will take inspiration from the project npm-on-ipfs.

Goal of dpip

The goal of dpip (distributed pip) is to bring IPFS into Python Package Index (pypi.org) ecosystem. It should serve as a functional unit, but also as a demonstrator for further discussion regarding native adoption into PyPa ecosystem.

High-level architecture

Currently, I see three main components that should be part of the implementation:

Data flow

  1. When the mirror daemon works it publish the project's files into IPFS and publish the root hash with IPNS, along with the updated index.
  2. When a client wants to install a package, it will refresh the index using the IPNS address. Then it will fetch and serve it to pip who will handle the rest of installation.

Index

There needs to be a mechanism of translation of package name into IPFS hash. The most natural approach for this might be using MFS, which is an approach that npm-on-ipfs takes. The root IPFS hash of whole PyPi namespace is mounted into a prefixed MFS path and should be regularly refreshed.

As IPNS resolution can take quite some time together with mounting it to MFS, the index's refresh process could run as a detached process in the background. The refresh process will require to implement checks for correct behavior like, there should not be multiple refresh process spawn, etc.

Look up of package then follows same structure like in pypi, where the path is constructed using the package names like: /<normalized_package_name/<wheels or sdist tarballs>.

dpip / pip

dpip will serve as a wrapper around pip, which will proxy most of the call except those which will directly link with IPFS. For now, I have identified these:

The pip will be extended mainly using the --index / --index-url parameter that will override the default lookup on pypi.org. The --extra-index-url can be used for fallback to pypi.org.

dpip will be shipped with default IPNS index address, that will be provided by authors of this tool. But it will provide options to specify different IPNS address that should be used for the index. In the future, there should be a command that will allow verifying that package in the IPNS index is same as in PyPi index.

It is a question of how dpip should be implemented. There are two approaches I see:

  1. A sole binary that mimics the pip CLI's interface and proxy the calls using spawning new process with calls to pip with the parameter --index. This is an approach that npm-on-ipfs used. No direct dependency on pip or any specific version is needed. It would require to implement HTTP server that would follow the PEP 503.
  2. Depend on pip as a package and invoke the proper functions based on the CLI arguments and options. This would most probably require to depend on specific versions of pip to ensure proper compatibility but still should aim to be able to function with an as wide range of pip's versions as possible.

Pinning

It would be beneficial to allow the user's of dpip to pin installed packages. This could be done using pip's cache, where wheels and sdists are present.

Questions:

Mirror daemon

The mirror daemon should be bound to specific IPNS address where the mirror will be placed, so people could produce their own indexes if they desire so. The IPFS's deduplication mechanism should work here for our benefit.

Project to use or at least inspire from is https://github.com/pypa/bandersnatch that provides full PyPi mirror and PEP503 compliant server.

Questions:

AuHau commented 5 years ago

Regarding which way the implementation should take, regarding #2 it would make sense to take the second way a.k.a. depend on pip and call the proper functions.

Jorropo commented 5 years ago

I think a good first question can be : "what language to use ?". golang ? That maybe too power full (with the complexity that it provide) for this project. python ? Logical, but actualy ipfs in py only are in http way, ipfs deamon in python is planned but ready far from release. javascript ? I think this is the best choice, that simple, we have some libs like orbit db that can be used to simplify devlopement and I'm pretty sure about the linking possibility with pip.

And what about embarqued deamon ? I personaly think we can let users choice, and use the http or full stack impletation on demend, (I personally prefer deamon because it actualy speed up adoption a little bit, making a ready to lunch package, no need to configure connection to a deamon).

AuHau commented 5 years ago

Yeah, I automatically started with Python, but during the analysis, I was also bit thinking if it is the way to go. I spent this morning thinking about it again and I think that Python is the way to go.

The main reason for me is the requirements that other languages would imply.

JavaScript would require NodeJS installed, which is definitely something which we should not require for running Python's package manager. Yes, there are ways how to compile JS to binaries, but not sure how much it works and how much the js-ipfs would work with it (which would be the main reason for picking JS over Python).

Go in itself does not have such a hard-dependency like JS, but as you mentioned it might be a bit overkill.

Lastly, a big plus for Python is that it could directly tap into pip and expand its capabilities (see #2 ), moreover quite some code can be reused from other solutions like already mentioned bandersnatch package.

Regarding the integrated daemon, it would be definitely a big plus and would lower the adoption barrier. That said, I think we should start with basic functionality and then expand on top of it. Python allows package binaries per platform, so in future, we could ship a version of the package with bundled go-ipfs binary.

hsanjuan commented 5 years ago

Hi,

the high-level overview seems ok, similar to npm-on-ipfs. It should work except for IPNS, which might be too slow. npm-on-ipfs uses js-ipfs, Since you seem to want to rely on the go-ipfs daemon, it's a good opportunity to test the ipns-pubub experimental feature and see how that goes. Otherwise you will need to publish your current on a non-ipns way (via dns or via an http endpoint).

One important thing to explore is in what format are pypi packages (I guess some kind of zip), and why (why was this chosen, what are the special characteristics of that format). One thing that would be amazing is to have custom importers for that format so that the ipfs DAG reflects not a chunked binary blob but the tree structure and files packaged. Unless they are simply TARs, this will however require significant work. But the reward is much better deduplication possibilities. At least, you should try to import packages using the Rabin chunker and the normal (fixed-size) one and see if size grows as fast as new versions of packages come in.

AuHau commented 5 years ago

@hsanjuan thanks for the feedback!

I am bit reluctant to go with the HTTP endpoint way like npm-on-ipfs, because I think it bit defers the point of decentralization (actually when I was studying the npm-on-ipfs, the endpoint was down...). I am aware of the performance issue with IPNS, that is why I plan to have the "refresh" process as a background task, which would result only in the first run to be long.

That is a very good point regarding the chunking! I haven't thought of that. Python has two main distribution formats - source distributions (eq. tars with Python's source code) and binary distributions (eq. custom format, that can contain compiled sources like C etc. and hence can be platform specific). I will dig a bit more how this could be handled.

Could you please provide pointers where I could study bit more about importers? https://github.com/ipfs/go-ipfs-chunker ? Also if I would implement custom importer, is there a way how to plug it into go-ipfs externally via. configuration? (I haven't found such a configuration) Or I would have to build custom go-ipfs binary with this importer?

hsanjuan commented 5 years ago

because I think it bit defers the point of decentralization

It does, but you also want something usable in the real world so sometimes compromises are necessary, even if temporary :).

Could you please provide pointers where I could study bit more about importers? https://github.com/ipfs/go-ipfs-chunker ? Also if I would implement custom importer, is there a way how to plug it into go-ipfs externally via. configuration? (I haven't found such a configuration) Or I would have to build custom go-ipfs binary with this importer?

There is no way to side-load custom importers, but go-ipfs folks probably would not mind including new impoters. Actually, impoters are two things, the chunker (go-ipfs-chunker does that), and the dag builder (ipfs supports balanced and trickle now: https://github.com/ipfs/go-unixfs/tree/master/importer). There is an example of a custom TAR importer at https://github.com/ipfs/go-ipfs/blob/master/tar/format.go which is used by the ipfs tar command, however it is very rough, and enforces Rabin chunking. In any case it's a good pointer of how a new importer for go-ipfs would work.