Crazy Idea: git backend registry

josh commented 10 years ago

@maccman and I had a pretty crazy, but I think genius, idea to just have the registry in a git repo.

It addresses many of the concerns in #73 and requires very little maintenance and upkeep from the bower team. Making the infrastructure more complex instead of less seems like a setup for failure given our current maintainership and budget issues. I think we can turn our limited resources into an advantage here.

/cc @paulirish @sheerun @rayshan @benschwarz

benschwarz commented 10 years ago

Looks fun, @josh

The impractical side birdsnest not running any code is that it won't record usage statistics.

Otherwise, manually registering packages via pull request would prove to be more work than what we already have (and cause delays in package registration).

These seem like blockers to me right now, sorry to rain on your parade… I think the idea is pretty cool, but I'd be interested to know your response to these concerns. Any plans?

josh commented 10 years ago

The impractical side birdsnest not running any code is that it won't record usage statistics.

I'd argue the registry is a poor place to track downloads since it isn't even offering them. Git clone and release downloads are recorded in the GitHub traffic graphs and even includes downloads that don't go through name resolution.

In terms of stats, do you have numbers on how much people care about this hit data? I imagine most people don't know it exists nor care. I think we should value decentralization and mirroring over any centralization analytics system for an open registry.

(Footnote: No one looks at their DNS queries to tell how much traffic their sites get)

Otherwise, manually registering packages via pull request would prove to be more work than what we already have (and cause delays in package registration).

Automatic merge of new packages is trivial to setup with a bot. The situation today for disputing package names and removal is a complete mess and pretty much nothing gets resolved. 1) because the process is manual for us maintains to ssh into the db and remove packages and 2) theres no way structured process to have a public discussion about removals. Things like delisted your own packages can be automated by a merge bot.

A full author history would give us an audit log to track down the original person that submitted a package or proposed a rename. Today the database is completely anonymous and completely fails us here.

benschwarz commented 10 years ago

I'd argue the registry is a poor place to track downloads since it isn't even offering them. Git clone and release downloads are recorded in the GitHub traffic graphs and even includes downloads that don't go through name resolution.

In terms of stats, do you have numbers on how much people care about this hit data? I imagine most people don't know it exists nor care. I think we should value decentralization and mirroring over any centralization analytics system for an open registry.

I don't have usage analytics for http://bower.io/stats/, @rayshan does though.

Automatic merge of new packages is trivial to setup with a bot. The situation today for disputing package names and removal is a complete mess and pretty much nothing gets resolved. 1) because the process is manual for us maintains to ssh into the db and remove packages and 2) theres no way structured process to have a public discussion about removals. Things like delisted your own packages can be automated by a merge bot.

:+1: SGTM.

A full author history would give us an audit log to track down the original person that submitted a package or proposed a rename. Today the database is completely anonymous and completely fails us here.

YES!

josh commented 10 years ago

So I still think there would be a place for separate "bower package index services". http://customelements.io would be a good example here. Community sites can handle niche curation and nicely displayed metadata and more extension full text search over package readmes.

Potentially "install" hits could still be managed by a standalone service that would be optional to participant in. Keeping it separate from the core registry solves many of the security and service availability issues.

Definitely a good question. I do feel a little biased because I'd personally like to be opt-ed from download tracking and choose privacy instead.

sheerun commented 10 years ago

Pros:

Companies could more easily create their own repositories.
We have clear history of edits to registry.
One could "pin" registry to given version and don't worry about continuous changes.

Cons:

Need to change old registry so it syncs with github repo.
Need to write a bot for accepting new components (also causes a lot of e-mail noise)
What to do with bower register command for old bower? Release a patch that disables it?
Problems with file limit on some systems? (about 20 000 files in one directory)
Long clone delay as a number of edits and new components grow

I propose middle-ground solution, and use github repository as storage for registry, instead redis.

Advantages:

It wouldn't break old bower register and new bower unregister command.
No need for bot accepting new components (users can send PR only for edits).
No e-mail spam of new component merges (only edit requests).
No long clones, no need to worry about files limit (registry is still used through http).

patrickkettner commented 10 years ago

Couldn't bower register be a command that opens a PR with the proper information?

Also, what OS's filesystem has a limit anywhere near that low?

On Mon, Oct 13, 2014 at 3:10 PM, Adam Stankiewicz notifications@github.com wrote:

Pros:

Companies could more easily create their own repositories.

We have clear history of edits to registry.

One could "pin" registry to given version and don't worry about continuous changes.

Cons:

Need to change old registry so it syncs with github repo.

Need to write a bot for accepting new components (also causes a lot of e-mail noise)

What to do with bower register command for old bower? Release a patch that disables it?

Problems with file limit on some systems? (about 20 000 files in one directory)

Long clone delay as a number of edits and new components grow

I propose middle-ground solution, and use github repository as storage for registry, instead redis.

Advantages:

It wouldn't break old bower register and new bower unregister command.

No need for bot accepting new components (users can send PR only for edits).

No e-mail spam of new component merges (only edit requests).

No long clones, no need to worry about files limit (registry is still used through http).

— Reply to this email directly or view it on GitHub https://github.com/bower/registry/issues/97#issuecomment-58939887.

patrick

sheerun commented 10 years ago

I meant already implemented bower register command and all of bower users that didn't upgrade.

Also, I'm pretty sure it's not possible to automatically open pre-filled PR on GitHub.

patrickkettner commented 10 years ago

I meant already implemented bower register command and all of bower users that didn't upgrade.

Isn't that the case for anyone that doesn't update it now? I would think that the best thing to do would create a new err code from the server that results in a 'update your client' message on the client (to help with any similar issue in the future), and then keep a thin server up that just returns that err until the hits get below an acceptable threshold.

Also, I'm pretty sure it's not possible to automatically open pre-filled PR on GitHub.

'course it is. https://developer.github.com/v3/pulls/#create-a-pull-request How else would github clients work?

On Mon, Oct 13, 2014 at 3:24 PM, Adam Stankiewicz notifications@github.com wrote:

I meant already implemented bower register command and all of bower users that didn't upgrade.

Also, I'm pretty sure it's not possible to automatically open pre-filled PR on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/bower/registry/issues/97#issuecomment-58941703.

patrick

sheerun commented 10 years ago

You need to fork registry, clone it, commit change, push it, authorize with GitHub API, and send PR via API. That's pretty hard to implement. Also I still don't see how to instead sending PR, show pre-filled PR.

Other issues are with too many e-mail notifications and writing auto-accepting bot.

I think it's easier and cleaner to write script that commits new entry instead accepting PR. Or course one could send new entry via PR, but it wouldn't be necessary given bower register command still works.

josh commented 10 years ago

Need to change old registry so it syncs with github repo.

Existing registry can be proxy to the new API to preserve compatibility with old clients.

Need to write a bot for accepting new components (also causes a lot of e-mail noise)

I'd probably suggest no one directly watch the "repo" but setup a team for people to mention when it requires direct review. Its a pretty typical GitHub workflow for repos with high volume issue trackers.

Also, I'm pretty sure it's not possible to automatically open pre-filled PR on GitHub.

Well, luckily I'm just the person to make that happen.

Problems with file limit on some systems? (about 20 000 files in one directory)

If we cared about FAT32 its 65,534. I'd say we wouldn't and every other modern FS supports trillions. Most length limits are on the file name itself.

This is more of a format issue, but you can also shard off the first letter. packages/j/jquery

Long clone delay as a number of edits and new components grow

Pretty unlikely. The current registry is about 4MB. Even with git history, gzip is amazing at compressing this stuff.

To put performance in perspective, a cold bower install jquery would have to fetch a little 4MB repo with names (just once) then clone down an entire 22MB jquery repo. Bower's perf issues are on package repo fetchs and updates, not this registry.

patrickkettner commented 10 years ago

On Mon, Oct 13, 2014 at 3:39 PM, Adam Stankiewicz notifications@github.com wrote:

You need to fork registry, clone it, commit change, authorize with GitHub API and send PR via API. That's pretty hard to implement. Also I still don't see how to instead sending PR, show pre-filled PR.

The proposed search was implemented by cloning the same repo, so if that plan was followed, the only hurdle would be a one time authorization.

As far as emails go - why would anyone subscribe to it? Even if someone did for some reason, they could easily opt out.

josh commented 10 years ago

You need to fork registry, clone it, commit change, authorize with GitHub API and send PR via API. That's pretty hard to implement. Also I still don't see how to instead sending PR, show pre-filled PR.

Web flow baby.

Something like https://github.com/josh/birdsnest/new/gh-pages/packages/foo?body=https://github.com/josh/foo.git could prefill the filename and body and you just have to it submit.

sheerun commented 10 years ago

Need to change old registry so it syncs with github repo.

Existing registry can be proxy to the new API to preserve compatibility with old clients.

Here are the routes:

app.get('/packages', routes.packages.list);
app.get('/packages/:name', routes.packages.fetch);
app.get('/packages/search/:name', routes.packages.search);
app.post('/packages', routes.packages.create);
app.del('/packages/:name', routes.packages.remove);

Only packages/:name could be easily proxied.

It's hard to proxy GET /packages. In bower it's used on bower search without argument. And who knows who uses it (e.g. your update.sh script). Maybe some clone & build every 1 minute trick would work.

/packages/search/:name is used by bower search :name. It would be nice if it used http://bower.io/search/ instead, but for now you need to use same endpoint for both search and viewing package endpoint (see docs on registry.search config option).

POST /packages and DEL /packages/:name could automatically commit to repository?

Or should we drop support for some endpoints and make someone angry? :)

Long clone delay as a number of edits and new components grow

Pretty unlikely. The current registry is about 4MB. Even with git history, gzip is amazing at compressing this stuff.

Still, currently if you want to install only jquery, there's one ~1KB request to /packages/jquery.

To put performance in perspective, a cold bower install jquery would have to fetch a little 4MB repo with names (just once) then clone down an entire 22MB jquery repo. Bower's perf issues are on repo fetchs and updates, not this registry.

Bower usually downloads packaged .zip tags. For jquery it's 750KB, just one file.

sheerun commented 10 years ago

Just FYI I'm not against this. I just want to list possible issues.

sheerun commented 10 years ago

Bower's search-server uses GET /packages as well.

josh commented 10 years ago

Just FYI I'm not against this. I just want to list possible issues.

Haha for sure.

Theres room for some debate just around the storage format alone. Originally I had just a single .txt file. But I think separate files avoids merge conflicts, takes file sorting out of the question and leads slightly better git object compression across changes.

File values could potentially be a full json object with other metadata, but I really don't know what else would need to be a core concern. Keywords and dependencies are described in the package's metadata which works better since the author can just change it.

Theres also an interesting issue about package name validation in regards to the FS. I mostly blame @maccman for the original poor validation. For an example, he removed a package from the registry that managed to register its name as an empty string "". It also seems like stuff like component/foo is valid today but that seems very problematic to the bower install phase. Theres no way to install both component and component/foo.

sheerun commented 10 years ago

The file format could be JSON:

{
  "url": "git://github.com/jquery/jquery"
}

It allows for tricks like following, even with current bower version:

bower install jquery --config.registry=https://sheerun.github.com/birdsnest

Theoretically one could host their own bower repo even now, just by forking birdsnest.

sheerun commented 10 years ago

What needs to be done:

[ ] Set-up staging registry for testing
[ ] Implement proxying /packages/:name to REPO_URL/packages/:name
[ ] Implement periodic fetching and generating /packages from REPO_URL
[ ] Proxy /packages/search/:name to http://bower.io (or deploy http://bower.io under http://bower.herokuapp.com)
[ ] Let POST /packages/ create new commit (or error for pre-filled PR URL)
[ ] Let DELETE /packages/:name create new commit (or error with pre-filled PR URL)
[ ] Check if all works properly and mirror bower registry in github repository
[ ] Immediately deploy staging registry under production endpoint
[ ] Change documentation about editing registry.

Did I miss something? Is it really worth it?

josh commented 10 years ago

Did I miss something? Is it really worth it?

I think it needs more buy in from other involved peeps.

patrickkettner commented 9 years ago

why the closure?

paulirish commented 9 years ago

I believe because there's not enough engineering interest currently. So better to keep it closed as its unlikely to happen.

bower / registry

Crazy Idea: git backend registry #97