Crawl any domain - Githubissues

astro commented 12 years ago

A few more servers are popping up across the Internet. As a start, the crawler can find them for example by watching what JIDs posted items on known channels.

For now, the ability to query multiple domains in once crawler instance would be a good start.

Did you implement service discovery at all? There are two strong reasons for this:

A channel server instance does not need to be defined. Knowing a domain name is enough.
Authority of content: channels.evil.com should not be the source for /user/*@trusted.org/*

Just ping me if you need any help. Discovery is really just two XEP-0030 queries.

abmargb commented 12 years ago

Service discovery is already implemented with XEP-0030. Actually the crawler tries to be the most generic as possible regarding PubSub. About new servers, that should be working already. The crawler discovers new servers by checking the server part of the jid of new posts/users. I'll check the database at crater to see whether this is working.

astro commented 12 years ago

Our domain topics.buddycloud.org may be configured manually. Topic nodes have no associated user and therefore never post anywhere themselves :)

abmargb commented 12 years ago

Sure :) We can define an static set of domains which will be crawled at first.

imaginator commented 12 years ago

one root to rule them all? ;)

abmargb commented 12 years ago

yep!

buddycloud / channel-directory

Crawl any domain #22