Contributions? - Githubissues

cgross commented 10 years ago

Hi - I have some time over the next 2 or 3 weeks. I was wondering if the team is looking for contributions to the registry work. If so, and if there was something that could be sliced off, I'd be happy to devote some time.

sindresorhus commented 10 years ago

@wibblymat @svnlto

cgross commented 10 years ago

Hey guys. Just a reminder that I'm willing to help. I really want to see Bower succeed and am willing to devote cycles to it.

FWIW, I've given a bit of thought about the easiest way to get the registry up and going. Using the bit of knowledge I have from working with the bower cli code, here's some thoughts. If we assume that package archives should be stored on s3 (or a similar service), then I think this could be accomplished in short order:

(s3) create a bower bucket. Set all contents as public readable but writeable only by the registry itself.
(registry) write a new publish route to take an archive and upload it to the bower bucket so the url looks like http://bower.s3.amazonaws.com/jquery/v1.2.3.tar.gz. Record the url for the package as http://bower.s3.amazonaws.com/jquery in the existing registry db.
(bower cli) write a new s3Resolver that works against urls starting with http://bower.s3.amazonaws.com. Resolvers generally need to do two things 1.) get the versions available and 2.) download an individual version. 1 can be accomplished with just a simple directory listing if we name the archives with the ver as suggested. 2 is obviously easy if the bucket is publicly readable.
(bower cli) write a new publish command to create the archive and upload it to the new service.

While that doesn't do everything that is likely desired, it would get an asset storage mechanism working with a pretty small and reasonable amount of code. And as an extra bonus, the server and bandwidth resources necessary wouldn't significantly change. The cli is still only asking for a url for a package and if its an s3 url, the cli is downloading directly from s3 and not going through the registry.

Anyway, just throwing this out there. I could get that work done two weeks or so, since its relatively minimal.

wibblymat commented 10 years ago

Thanks for thinking about this! S3 is something I hadn't considered so far. Up until now I'd assumed we would follow the npm route and use CouchDB. The major advantage of CouchDB is that replication/mirroring is trivial to set up, a feature I'm sure we want to get eventually.

We could do with properly exploring our data storage options. We need database storage for package information, users, etc. and binary storage for package archives. Need to look at cost, reliability, replication, all of that sort of thing.

As for useful tasks to look into... I'll get back to you later today. Ping this thread with an @wibblymat if I don't!

wibblymat commented 10 years ago

@cgross I should say that I am in principle in favor or your plan! Thinking about CLI changes is something that has been ignored up until now.

wibblymat commented 10 years ago

In fact, that's it. Would you like to look into a 'publish' command?

bower publish [path]

path is optional, the base path of the package. By default we look for a bower.json in the current directory or any parent directory and choose the first one as the package path.

Either way, the path needs to contain a bower.json.

Use the bower.json information to create a package tar.gz, <name>-<version>.tgz, containing files and subdirectories within the path. It should not contain any files in the ignore list if there is one.

Hopefully by the time you've got that we'll have a better idea of where you will be sending it.

Sound good?

cgross commented 10 years ago

Definitely. I will start to work on publish. Regarding s3, I think I'd read somewhere that the npm team is actually moving the binaries out of couchdb. Basically keeping all the metadata in the db but moving the rest out to some sort of file store. Thats what got me thinking s3.

I'll get started on publish and submit a pull to the cli project when its done. Excited to be helping out!

benschwarz commented 10 years ago

Storing package data in CouchDB is exactly why NPM costs so much time & money to keep running — Storing in S3 (with a CNAME) means that we have cheap, redundant storage.

The major issue to overcome from there is how we allow total replication… but thats probably not a discussion for here.

cgross commented 10 years ago

Yup. Storing binaries in databases is never a great idea. A CNAME would be a nice way to hide the s3 url from downstream code.

svnlto commented 10 years ago

Here's how npm does it. http://blog.npmjs.org/post/75707294465/new-npm-registry-architecture

Storing package data in CouchDB is exactly why NPM costs so much time & money to keep running

Really?

paulirish commented 10 years ago

@cgross awesome to have you step up and get involved! that's the best way. :)

Things look good now, but Holler if you get stuck and need feedback to keep moving. I want to make sure you're unblocked and kickin ass.

cgross commented 10 years ago

@paulirish Thanks! Appreciate the help. Glad to get rolling :)

marcooliveira commented 10 years ago

Been a while since I've been here, but giving my 2 cents, I agree that having an external storage is likely better than storing on couchdb.

On a related matter, are you guys considering/looking for additional storage and bandwidth capabilities?

janl commented 10 years ago

Friendly CouchDB person here :)

These are a few random notes on the topic, I hope you don’t mind me jumping in here.

The new npm architecture might seem a bit convoluted, but they were trying to make sure to not break backwards compatibility as best as possible, so there were a few extra hoops to jump through.

Generally:

Not having to operate anything is better than operating something (S3 beat CouchDB or other).
Having control over all pieces of the infrastructure beats giving up control (CouchDB or other beat S3).
HTTP is your friend.

There are a few trade-offs that you all need to be deciding on (that the npm architecture IMHO covers neatly)

you want to avoid a single point of failure for two reasons:
1. high availability: it’s not fun if the registry is down. For binaries, a CDN helps there, for other data & write requests you want another solution.
2. latency: if the main sever or cluster lives in the US, people in Australia have a bad time using the service. Again for binaries, the CDN helps, for other data, you want another solution. Things like Travis CI also work best with local mirrors.
3. (intentional off-by-one) privacy: some folks might want to use the bower registry infrastructure, but for private packages. Or they don’t want to leave public traces of what package they want to use.

CouchDB makes it extreeeeemely trivial to get all of the above through its built-in replication mechanism. If you are lacking a mental modal, think of git remotes, you can push and pull data at will between all sorts of (semi-)connected locations and end up with a consistent set of data every time.

you want to avoid storing too many binaries in CouchDB for a couple of reasons:
1. Each database is stored in a single file on disk. This has tremendous advantages for reliability, speed, consistency etc. but it also means that you may end up with really large database files in your filesystem. As is, this isn’t a problem, both your operating system and CouchDB can handle this fine. When it comes to backup/restore and other maintenance things, this can get a little hairy.
2. For data durability reasons, CouchDB never throws away any data that has been committed to the database file (this is good). As such, it relies on a procedure called “compaction” to clean up old data (Postgres and SQLite call this “vacuum”, it is the same thing). Compaction costs i/o and the larger the database, the more expensive the i/o tax is.
you want to avoid having to manage consistency between the core data and the binaries manually. npm does this cleverly in the new architecture by “atomically” moving the binaries out of the core database and into the binary store / CDN system. E.g. this is doable, but you need to be careful designing this. Keeping binaries in CouchDB just takes care of all that.

Storing binaries in databases is never a great idea. — @cgross

Unless you have a database that is designed to do this, like CouchDB. It’s just that in the npm, and then likely bower case (assuming they are shaped similarly) that binaries outweigh the rest of the data by a lot. CouchDB was more designed to handle things like email, where some records might have some or more attachments.

You could diverge from the npm model by using a single database per package instead of a database for all packages. This would mitigate all the operational issues around large databases. CouchDB handles tons of databases just fine (including some tricks that make stour average file system happy with 100s of 1000s files). I haven’t though this through fully, but it might just work for you.

The trade-off there is that aggregation of data isn’t as simple as with having it in a single database. E.g. a “find all packages by author X” is harder to realise than with a single database. However, all of this could conveniently be outsourced to an instance of ElasticSearch which is a brilliant piece of software designed to help with this.

The nice thing about the new npm architecture (IMHO) is that it serves end-user traffic via a CDN (outbound) and a small, tight, fast, no-binary-data-only CouchDB (inbound) while maintaining a full regular CouchDB (fullfatdb) that interested parties can replicate down to their location just fine. This big CouchDB does not get the traffic from end-users, and as such all the problematic things from above are not as big a problem, and potentially a viable trade-off to make for the people who end up maintaining this.

CouchDB roadmap items relevant to this discussion:

In the past months some of the issues that npm had we managed to fix in the CouchDB code base. I believe all fixes we have are in the upcoming 1.6.0 release. I currently, off-hand, don’t know if we got them all, but things are a lot better than in 1.5.0.

On top of that: CouchDB is ultimately a clustered database (the “C” even stands for “Cluster”), but the 1.x series is a de-facto single-server database, that you can cluster with some extra work. There is a fork of CouchDB called BigCouch that adds the full C back into CouchDB and that fork is currently being integrated back into Apache CouchDB proper. The nice thing here is that the code has been battle tested in the Cloudant platform with tons of data and users and neither npm nor bower would be the biggest tenant supported by that code.

This brings a couple of advantages:

Clustering: you can use more than one physical server to host all your data. All consistency and replication guarantees remain. That also means each server has only a part of all the data, so all the problems with big databases are reduced. In addition, you can already cluster on a single machine and while you won‘t get any high-availability or speed improvements, you still get a set of smaller databases on the single server, further mitigating the large-database trade-offs.
Cheaper compaction: the BigCouch merge will introduce a smarter compactor that uses less i/o and produces smaller database files, further mitigating the mentioned issues.

The catch: this isn’t shipping code yet, but it is actively being worked on right now. We don’t have a timeline, it’s open source after all, but Cloudant, the BigCouch sponsors, are committed to see this through and the CouchDB developer community supports this equally. Given that npm went fine for a couple of years on just a single instance CouchDB, this might work out, as I don’t think the merge will take us another year (but again, no promises :).

Finally, I’d like to offer any support that Team Bower needs on behalf of Team CouchDB. We are happy to help, just ask any questions you might have:

user@couchdb.apache.org
irc.freenode.net/#couchdb
http://twitter.com/couchdb

Or ping me: jan@apache.org / @janl.

Finally, finally: I’m a big believer in using the right tool for the job and if you/we find out that CouchDB isn’t the right thing to back bower’s registry, then I am very happy that we don’t have an unhappy user here :)

wibblymat commented 10 years ago

Thanks @janl for the detailed insight!

FWIW, I have quite a lot of experience running and using CouchDB and certainly agree that for the package metadata it is absolutely perfect. I also lean towards using it for the binaries, with binaries kept in one or more databases separate to the metadata to mitigate scale issues. I disagree with @benschwarz and @cgross that it is a fundamentally bad idea, but there may well be better ones.

Our usage does not seem appropriate for a relational database because the data we will store (bower.json files) is non-relational.

I'm still open to discussing other non-relational databases and other options for storing binaries

cgross commented 10 years ago

Yea thanks for all the info!

Based on the NPM article, we know that they store binaries in manta (similar to s3) AND in couchdb. The npm cli pulls directly from manta but falls back to the couchdb instance if necessary. But the primary reason for storing the binaries in couch is that it makes replication easy.

It sounds like you suggest this approach as well. Sounds reasonable to me. And we'd put a CDN in front of the s3/file storage solution as well.

cgross commented 10 years ago

@wibblymat The publish command is creating archives now. Ready to keep rolling. Would it make sense for me to create a placeholder route for publish and complete the code to upload the archive to that route? Leaving all the db code for later. I could also do some quick code to drop the archive in s3 as well. Or whatever else you think would be a good next step.

dch commented 10 years ago

One other thing worth mentioning is that CouchDB has built-in support / awareness for etags. This means that even a normal couchdb instance, with a decent varnish-like layer in front, hosted or local, can push surprisingly heavy traffic. CouchDB is HTTP, from the ground up, in a way that no other DB is. Good luck either way!

sindresorhus commented 10 years ago

I also lean towards using it for the binaries, with binaries kept in one or more databases separate to the metadata to mitigate scale issues. I disagree with @benschwarz and @cgross that it is a fundamentally bad idea, but there may well be better ones.

Well, seeing as npm says it was such a mistake, I don't see the logic in committing the same mistake, even if separated from the metadata. Storing files in a DB is a fundamentally bad idea. The binaries should be on a CDN.

benschwarz commented 10 years ago

While CouchDB http replication is super-cool and easy to setup, I'm actually more concerned that we'd be unable to change schema / structure of the database, without breaking a million clients / libraries / services (backwards compatibility, too).

"Because NPM does it…" is not a convincing argument for designing system architecture.

cgross commented 10 years ago

Also, the primary benefit of putting binaries in the db is to aid in replication... but I wonder why there's focus on that. How many people really need to replicate the bower repository (or npm's either)? There are certainly use-cases but its gotta be a small sliver of people. Is it a big deal if that small sliver of people have to run a script to download all the binaries off a file store?

janl commented 10 years ago

(note, I just want to make sure that you have all the information to make a good decision, not “sell” you on CouchDB)

Well, seeing as npm says it was such a mistake

As explained above, the problem is not with binary data in the database per-se, but the rate of binaries to regular data as it pertains to how CouchDB does storage today. I assume the bower registry being not as big as npm (please correct me, if I’m wrong) and as such the BigCouch developments could very well be in place by the time you need them.

Storing files in a DB is a fundamentally bad idea.

…unless the database is designed to do that. If you were going with the database-per-package route, things would not be bad at all. The BigCouch architecture would even solve this for the one big database for when you get there.

CouchDB sends binaries from the filesystem straight over HTTP to the clients (not using sendfile()), but it is still a straight shot through the kernel). This is considerably different from databases that connect via a binary protocol to a middleware layer that then is connected to a web server (where you cross kernel-land and user-land two or three times).

The latter scenario is where the conventional wisdom (“…is a bad idea”) comes from.

The binaries should be on a CDN.

No argument here. The CDN just needs a source and CouchDB is as good as any for this :)

I'm actually more concerned that we'd be unable to change schema / structure of the database

Excellent point, npm had to put smarts to avoid issues here into all the client releases. Sometimes this didn’t work out nicely, but afaik it hasn’t caused major issues (yet :). With a little foresight, the clients could be made aware of potential schema changes.

If you have a middleware layer, that migration logic can all live under your control, which is definitely a nicer situation better than having to worry about clients that are not in your control. Either way though, you have to handle the migrations somewhere. I’m not sure how likely a DB-schema change would be without a corresponding public API change (say a field rename) which would put clients and other 3rd party consumers under BC risk again. You could still use CouchDB and a HTTP proxy that could do record migration on the fly depending which schema version the clients request. This is not trivial and may break in interesting cases, but it’ll still allow for seamless mirroring.

Personally, I’d trade the advantages of being able to clone the registry easily for having some extra care put into migration work, but that’s entirely up to you, of course :)

How many people really need to replicate the bower repository (or npm's either)?

npm already seems some of this (mirrors on every continent, lots of in-house mirrors for enterprises), but if you look at other package distribution systems (apt-get, all Apache software, etc), the idea of mirrors is baked into the very fabric of them all. — Some folks argue that in an age of affordable CDNs this is all overhead, but there are at least a few valid use-cases. Again, not that you have to care about them, just making sure you are aware which doors you might be closing.

Is it a big deal if that small sliver of people have to run a script to download all the binaries off a file store?

Getting a consistent snapshot of dynamic metadata plus binaries is the tricky bit. While not impossible (hello rsync), it is not trivial :)

rayshan commented 10 years ago

Hi all, I really enjoyed reading your input here. Since this fizzled out a bit, I submitted a new issue #73 to move the rewrite forward. Would love your input here.

sheerun commented 8 years ago

Nowadays it's clear that bower won't create another registry with ability to upload binaries.

Npm's registry already provides great way to upload binaries and common name space. The only way we could improve the situation is to make easier to download package from npm registry with bower.

That's for another issue though

bower / registry

Contributions? #52