Registry - Githubissues

dominicbarnes commented 10 years ago

So, I've been thinking about this a lot over the last couple of days. I wanted to start a discussion by sharing my thoughts on this, before I started writing any real code.

I think the best place to start is to have the registry only responsible for searching. I don't think it's necessary to add tarballs or even deal with versioning at all.

With that being said, I think the registry would only need a few bits of data: (the rest lives in the source repository, so just a link should suffice here)

name (user/repo)
description
keywords/tags
stars/followers/etc

Now, onto the topic of how data is added to the registry. After considering several alternatives, I think the best way to approach this is to use webhooks. We can make this extremely easy for duo users by having duo-publish use the Github API to add the webhook for them. (then once the hook is in, no more work will be needed by the developer)

I'm thinking the "push" event is probably more than enough, although we can easily add more. We will need some sort of custom server ready to handle webhooks from various services. Github is obvious, but BitBucket support is in the works too, so we should probably have this server pluggable. (or at least easily extendable) It should probably support a "manual" API, allowing people to manually add their repo, without the continuous stream of webhooks.

I figure that each webhook call would trigger a "scrape" of the repository. (depending on what information comes in the payload of course) The component.json would be checked first, falling back to a package.json, and lastly to the repository meta itself. (all using the Github API I presume)

Lastly, I think that "keywords" can be used to group components in the search interface. I'm thinking duo can have a few that are special-cases that likely matches the structure of the wiki now. (eg: "ui-element", "utility", "async", etc) Beyond that level of structure, I think the rest of the fields are just searchable as plaintext.

As far as the technical implementation details, there are lots of possibilities. My first thoughts would be Heroku for the app, CouchDB for persistence and Elasticsearch for the actual searching/indexing. But depending on who would like to collaborate, how we would want to deploy, etc we can always work with other tools.

Anyways... sorry about the huge blob of text, as you can tell I've put a lot of thought into this lol

tl;dr

registry only be responsible for searching
added to via Webhooks (and possibly expose a "manual" JSON API)
uses component.json (or package.json)
predefined keywords can be used by duo to group components

anthonyshort commented 10 years ago

Setting up a webhook using a publish command sounds awesome. Zero friction.

Rate limiting

Github could rate-limit the server if it’s using the API on every push.

Searching/grouping.

One thing Component got right, even accidentally, was that the wiki was really easy to search through. Searching via keywords doesn’t really work that well and is the main reason npm sucks for finding things. Using the keywords to group by category would be the best way to do it. This allows people to narrow it down themselves and then search from there. Basically how the app store works, you need people to discover that way rather than searching, because they don’t know what to search for.

Might be worth having a fixed set of keywords that group them on the main page, and then all other packages that don’t fall into that can just be searched for or there could just be a giant list.

Basically, discoverability is really important :)

dominicbarnes commented 10 years ago

@anthonyshort I totally agree, I never used component.io, just pulled up the wiki and searched w/ my browser. The categories were indispensable for me too, which is why I want to put emphasis on them.

matthewmueller commented 10 years ago

@dominicbarnes thanks for putting this together! I definitely think this approach makes sense.

The only thing I'd like to pull in is the discussion @ianstormtaylor and i had here: https://github.com/segmentio/khaos/issues/41

I think if we go down this route, the registry should be namespaced to allow future projects to use it. In fact, I don't think this should even be a Duo-specific registry.

One thing Component got right, even accidentally, was that the wiki was really easy to search through.

Agreed, that's definitely what helped get the project off the ground and if you remember Node did that same thing back in the day, when there were only a handful of modules.

I figure that each webhook call would trigger a "scrape" of the repository. (depending on what information comes in the payload of course

@dominicbarnes would love to get the description and perhaps readme and code scraped. Github search is nice cause it also combs through code / comments. That'd definitely increase the scope of the project.

I'm fine with whatever on the application architecture. We could even go simpler than that use https://github.com/bigeasy/locket and maybe a pure JS search/indexer. Definitely wouldn't scale, but it might just be enough for now, as long as the registry only handles search.

Open question

How would we transition existing repos over to this new registry?

dominicbarnes commented 10 years ago

How would we transition existing repos over to this new registry?

@MatthewMueller I think we can write a script that can take care of everything in component/*. (eg: iterate repos, clone, check for component.json, duo publish)

I think the amount of work required of devs is pretty minimal, it's arguably even simpler than editing a wiki, so I think we make sure to broadcast that out and let the devs take care of it themselves. Perhaps we can use a bot to traverse everything in the component wiki and open an issue asking them to consider adding to the new duo registry. (which they can close if they choose not to)

ianstormtaylor commented 10 years ago

Yeah we can even just index everything in the registry automatically for them? Cuz I think we won't need any permissions or anything?

On Mon, Aug 18, 2014 at 12:05 PM, Dominic Barnes notifications@github.com wrote:

How would we transition existing repos over to this new registry?

@MatthewMueller https://github.com/MatthewMueller I think we can write a script that can take care of everything in component/*. (eg: iterate repos, clone, check for component.json, duo publish)

I think the amount of work required of devs is pretty minimal, it's arguably even simpler than editing a wiki, so I think we make sure to broadcast that out and let the devs take care of it themselves. Perhaps we can use a bot to traverse everything in the component wiki and open an issue asking them to consider adding to the new duo registry. (which they can close if they choose not to)

— Reply to this email directly or view it on GitHub https://github.com/component/duo/issues/197#issuecomment-52539282.

dominicbarnes commented 10 years ago

@ianstormtaylor I think we can do a 1-time pass for indexing things. However, those will involve API calls, so we may need to batch them or something. (since we'll have a single server that's making the API calls) This also means we probably won't be automatically updating them, which is where we need repo authors to add webhooks.

ianstormtaylor commented 10 years ago

Ah truth forgot about that requirement, sounds good to me

On Monday, August 18, 2014, Dominic Barnes notifications@github.com wrote:

@ianstormtaylor https://github.com/ianstormtaylor I think we can do a 1-time pass for indexing things. However, those will involve API calls, so we may need to batch them or something. (since we'll have a single server that's making the API calls) This also means we probably won't be automatically updating them, which is where we need repo authors to add webhooks.

— Reply to this email directly or view it on GitHub https://github.com/component/duo/issues/197#issuecomment-52563294.

dominicbarnes commented 10 years ago

Well, in actuality any scrape we do is going to involve API calls (to retrieve the component.json, package.json, etc) since a "push" webhook won't have all of those details in the payload. Perhaps the "scrape" on every "push" isn't a solution that will scale.

Maybe we would need to scrape more sparingly. For example, on the initial publish as well as on "create" (webhook for tag/branch creation) instead of on "push". Depending on what commit information is shown in a "push" event, maybe we can inspect for changes to the manifest and conditionally scrape if we think something has changed there.

Gozala commented 10 years ago

Copying my thoughts from dupe I created here:

While I love no central place for aggregating packages and the fact that everyone just hosts their own, that definitely makes discovery a lot more painful than say npm. That being said I think this problem should be easy to solve and I hope duo will do it to be more appealing for everyone. I propose to add three commands:

duo register
duo unregister
duo find

First one should just look at the git config of the CWD and register github repo somewhere is the duo registry server. Second command should unregister.

Third command should query duo registry to discover matching packages.

Gozala commented 10 years ago

Now as far as discussion gos here I do not think that there is a need for tracking pushes to a github, all registry needs to have is a pointer to github repo all the versions lookup etc.. can be done live using github API. While it maybe little challenging to do well in the CLI it's should be just fine in the webapp version as additional details of the search results can be oppressively added as results are fetched from github API. This also reduces whole class of out of sync issues that may occur if I delete tags or force push etc etc..

Gozala commented 10 years ago

I think it would be also nice to generate description from the Readme files in the repo rather than having to specify one during registration.

stephenmathieson commented 10 years ago

duo find

duo search!!

willfarrell commented 10 years ago

Since it will supports all bower packages. Why not base it off bower, making it easier for developers to adopt? https://github.com/bower/registry

$ bower

Usage:

    bower <command> [<args>] [<options>]

Commands:

    cache                   Manage bower cache
    help                    Display help information about Bower
    home                    Opens a package homepage into your favorite browser
    info                    Info of a particular package
    init                    Interactively create a bower.json file
    install                 Install a package locally
    link                    Symlink a package folder
    list                    List local packages
    lookup                  Look up a package URL by name
    prune                   Removes local extraneous packages
    register                Register a package
    search                  Search for a package by name
    update                  Update a local package
    uninstall               Remove a local package
    version                 Bump a package version

Options:

    -f, --force             Makes various commands more forceful
    -j, --json              Output consumable JSON
    -l, --log-level         What level of logs to report
    -o, --offline           Do not hit the network
    -q, --quiet             Only output important information
    -s, --silent            Do not output anything, besides errors
    -V, --verbose           Makes output more verbose
    --allow-root            Allows running commands as root
    --version               Output Bower version

See 'bower help <command>' for more information on a specific command.

Fishrock123 commented 10 years ago

:+1: for duo search

ianstormtaylor commented 10 years ago

I think @dominicbarnes's idea for having it hooked into GitHub pushes is key for keeping the registry up to date easily though. If we're going to want to provide more than just a URL in the search results we'll want descriptions and readmes and stuff, so it would be nice if they just stayed up to date automatically? Would be curious to know what Bower does for that or what their search even looks like.

+1 to:

duo register
duo unregister
duo search

Gozala commented 10 years ago

I think @dominicbarnes's idea for having it hooked into GitHub pushes is key for keeping the registry up to date easily though. If we're going to want to provide more than just a URL in the search results we'll want descriptions and readmes and stuff, so it would be nice if they just stayed up to date automatically? Would be curious to know what Bower does for that or what their search even looks like.

Well the fact that you'll have to sync opens opportunity to get out of sync. What I am suggesting to just query github API live to get most up to date info during search. That way you can't get out of sync :)

dominicbarnes commented 10 years ago

@Gozala using GitHub Web Hooks is the way to keep in sync, we're not talking about something like npm publish. GitHub will trigger updates to the Registry for many different events that developers will take on their repo. Thus, duo publish will basically open up a stream of updates, whereas duo unpublish will shut it off.

Gozala commented 10 years ago

I know how GitHub web hooks work. All I'm trying to say is that I think that this introduces complexity, will use more space and has an opportunity to get out of sync (server is down or bugs or whatever). Quering github live is free all of these constraints, although in practice it may end up little too slow, but if it isn't I think it's a lot simpler option.

That being said I don't have anything against what's already proposed.

ianstormtaylor commented 10 years ago

Doing it all live would definitely be ideal, but I don't think it'd be possible to get the most up-to-date information from GitHub in a short enough time for things like $ duo search event which would return 10–100s of results? Search should be sub-second to be useful I think.

johntron commented 9 years ago

I created a super basic duo-search to save me a little time, but it just uses Github's search API to find JavaScript repos matching a keyword: https://github.com/johntron/duo-search. I'll happily transfer ownership.

johntron commented 9 years ago

@dominicbarnes I'd love to help setup the registry - have you already started?

dominicbarnes commented 9 years ago

@johntron I never really got anywhere with it, haven't taken the time. I would love to see duo-search become a JS-api that can be consumed by multiple tools. (eg: a CLI, a web app, etc)

johntron commented 9 years ago

@dominicbarnes done: https://github.com/johntron/duo-search#usage-api

duojs / duo

Registry #197

tl;dr

Rate limiting

Searching/grouping.

Open question