Federated search - Githubissues

nicksellen commented 3 years ago

A tricky topic is federated search (of hosting offers). I had a little dig around, and can imagine three general approaches:

seperate external search indexes (in practise, semi-centralized), e.g. peertube --> https://search.joinpeertube.org/
index offers from federating instances in each instances local database (could be tricky with large numbers of instances/offers) (I think this might be how hashtag searching on mastodon works?)
federate search queries across instances (with special search endpoint, currently not part of activitypub?) https://inventaire.io/ is considering this approach

Possibly/partly related links:

aschrijver commented 3 years ago

Shall we give these options distinctive names? E.g.:

External Search Services
Integrated Full Indexer
Federated Queries

The 'trickiness' depends in part on what you want to search for, i.e. is it just metadata search (host name, location, etc.) or also full-text search of description fields (e.g. host profile description, discussion threads).

In the case of metadata-only search Option 2 might be a viable option.

Federated Queries are quite interesting. If investigating that option further, there's broader, more general application and cooperation with e.g. Inventaire and others to create a standardized way to do this.

aschrijver commented 3 years ago

There's a topic in SocialHub Querying ActivityPub collections where I mentioned the Meld protocol, which looks interesting for an Option 3.

I don't know yet how that would fit in, and created https://github.com/m-ld/m-ld-spec/issues/64 to ask.

mariha commented 3 years ago

There is one more option, let's call it 0.

For 0. (distributed index as an external service) Couchbase can be deployed with an index as a service external to the data storage (on a single node or distributed on a few nodes). The index uses distributed hash table as a data structure. I think (need to be double checked) secondary indexes in Couchbase are built with prefix hash trees, based on trie tree. We planned to implement something similar for distributed index in IBM Cloud Object Storage.

For 3. (distributed query) I think this is how MongoDB does distributed search.

chmac commented 2 years ago

@mariha How does option 0 differ from option 1? I see them as the same. It's an independent index, hosted externally somewhere. Introduces all the problems associated with centralisation... 🤔

mariha commented 2 years ago

There is technical centralization of infrastructure (with potential issues of contention, overloading etc) and human centralization of governance (with power issues, for example).

So I guess I meant that from technical perspective, an external index can be decentralized (0.) or not (1.). At the beginning a simpler architecture might be enough and then, when (and if) needed it could be redesigned for bigger scalability.

At the beginning we have a few bigger platforms and there are not that many users all together (70k on TR, 160k BW, only CS claims 12m...), over time when (and if) there were smaller groups / single user instances and it was too much for them to store info about all hosts in the network (2.) or to serve search requests from all of the users (3.) there might be (potentially) an external index to offload them.

mariha commented 2 years ago

Answering @chmac comment from elsewhere...

If we have 100 networks each containing 1k members, and they all interoperate (ala Matrix) with all other servers. Then we have a network of 100k users.

How do we manage search?

Query forwarding: Whenever somebody searches, their "home" server sends their query to all other servers who all reply

User forwarding: Each server sends a copy of some basic user info to all other servers

In the Query Forwarding model, if a server is offline, their users are not visible. Each server has to handle "load" for every user of every site. So to participate in the network hosting a small server with say me and 10 friends, I need to handle the search traffic for 100k users.

In the User Forwarding model each server needs to store all the users of the entire fediverse. In this example that means 100k user profiles. Then each server needs to receive and process a constant stream of updates from each other server. Every time a user changes some aspect of their profile this event is propagated to every single other server.

For our scenario of storing hosting offers, I presume profile updates are much less frequent than searches, so without doing any estimations user forwarding (2.) seems to create less traffic than query forwarding (3.) and most requests would be served locally. We could forward as little as a user id and their geographical location (longitude, latitude). Based on that we can display a pin on a map and when selected, maybe request more details from the user's home instance, located by their id (either directly with user@host scheme or with some address resolution service with uuids). With that request we can also choose how much details to reveal.

Query forwarding is what @chagai95 implemented for the demo, and with a few big instances only, maybe with some caching and batching, I can imagine it could work quite well for a while (and we may not ever get to this point of connectedness anyways). It also allows to restrict who can ever see a host.

If at some point we wanted to implement users (travelers) tracking, which would require updating their location (and an index) frequently, query forwarding might be better for that.

Also some other option/design might be better for publishing (broadcasting) of hosting requests. Unlike hosting offers, they are time-constraint. Maybe also different in some other ways…

I did also some estimations, more as an exercise then anything else. For users (hosting offers) forwarding (2.):

How big namespace do we need?
- current stats: 70k on TR, 160k BW, maybe 130K WS, 170k HaS..., 12m on CS
- Let's assume 1M users (note: not everyone is a host!)
- uuids - globally unique
  - 2^20 > 1M, smallest that is enough
  - 2^24 > 16M, 3bytes allows to address over 12 millons of users, with 0.75 load factor for hasing function to avoid collitions ;)
  - 2^32 > 4G, 4bytes gives namespace for over 4 mlds of users, enough namespace for half of the humanity, regardless of age, to offer a stay at their home
  - zot uses 2^256 namespace size, that would need 32bytes of storage per user
- user@host
  - max host length: 256 chars => 256 bytes (without user part)
  - let's assume on average 32chars => 32bytes
How much space would it take to store an index with all hosts?
- uuids:
  - each user on a map: (uuid, longitude, latitude) -> 4 + 1 + 1 byte = 6 bytes
  - 1M users * 6 bytes -> 6Mb to store all users in the network
    - it's ok even for a single user instance, especially that they may not federate with everyone else
- user@host
  - an average user on a map: 32 + 1 + 1 byte = 34 bytes
  - 1M * 34b -> 32Mb for all users
    - still acceptable ?

Let me know if I made any mistakes (quite probable!) or wrong assumptions.

chmac commented 2 years ago

@mariha Very cool. Bringing numbers into the discussion makes a lot of sense. Personally, I'm of the opinion that any network >1m people will fail for political / spam reasons. So the calculation of having a max 1m users with lat / long and some profile identifier at <50mb sounds totally reasonable to me.

I don't see any obvious issues with your calculations. Maybe there's extra data along with the identifier, some cache of the profile, or at least a name, trust score, etc. But that's also still small data, and only text. So it seems to me from these numbers that doing local search, and forwarding updates, makes a lot of sense.

There's also a privacy enhancement to searching your own data set locally. Either as a single node operator, or a network, I don't necessarily want to tell all other nodes which areas I'm interested in travelling to, how many visitors I have on my site, etc.

Based on HaS being the largest network currently with 170k users, an index or profile IDs and locations would be at most 5.7MiB (based on your 34k id + lat + lon). That's plenty small enough to be in memory on any system.

mariha commented 2 years ago

(there is actually an error 😎! latitude and longitude do not fit in 1b each, we may need 8b double float to store each of them (or 4b with some precision loose, maybe acceptable) - that makes an index with all HaS user ids and locations up to 7.8MiB which I think is also fine...)

OpenHospitalityNetwork / openHospitality.network

Federated search #3