OpenHospitalityNetwork / openHospitality.network

Website, community issues tracker and wiki
https://openHospitality.network
GNU Affero General Public License v3.0
22 stars 4 forks source link

Federated search #3

Open nicksellen opened 3 years ago

nicksellen commented 3 years ago

A tricky topic is federated search (of hosting offers). I had a little dig around, and can imagine three general approaches:

Possibly/partly related links:

aschrijver commented 3 years ago

Shall we give these options distinctive names? E.g.:

  1. External Search Services
  2. Integrated Full Indexer
  3. Federated Queries

The 'trickiness' depends in part on what you want to search for, i.e. is it just metadata search (host name, location, etc.) or also full-text search of description fields (e.g. host profile description, discussion threads).

In the case of metadata-only search Option 2 might be a viable option.

Federated Queries are quite interesting. If investigating that option further, there's broader, more general application and cooperation with e.g. Inventaire and others to create a standardized way to do this.

aschrijver commented 3 years ago

There's a topic in SocialHub Querying ActivityPub collections where I mentioned the Meld protocol, which looks interesting for an Option 3.

I don't know yet how that would fit in, and created https://github.com/m-ld/m-ld-spec/issues/64 to ask.

mariha commented 3 years ago

There is one more option, let's call it 0.

For 0. (distributed index as an external service) Couchbase can be deployed with an index as a service external to the data storage (on a single node or distributed on a few nodes). The index uses distributed hash table as a data structure. I think (need to be double checked) secondary indexes in Couchbase are built with prefix hash trees, based on trie tree. We planned to implement something similar for distributed index in IBM Cloud Object Storage.

For 3. (distributed query) I think this is how MongoDB does distributed search.

chmac commented 2 years ago

@mariha How does option 0 differ from option 1? I see them as the same. It's an independent index, hosted externally somewhere. Introduces all the problems associated with centralisation... 🤔

mariha commented 2 years ago

There is technical centralization of infrastructure (with potential issues of contention, overloading etc) and human centralization of governance (with power issues, for example).

So I guess I meant that from technical perspective, an external index can be decentralized (0.) or not (1.). At the beginning a simpler architecture might be enough and then, when (and if) needed it could be redesigned for bigger scalability.

At the beginning we have a few bigger platforms and there are not that many users all together (70k on TR, 160k BW, only CS claims 12m...), over time when (and if) there were smaller groups / single user instances and it was too much for them to store info about all hosts in the network (2.) or to serve search requests from all of the users (3.) there might be (potentially) an external index to offload them.

mariha commented 2 years ago

Answering @chmac comment from elsewhere...

If we have 100 networks each containing 1k members, and they all interoperate (ala Matrix) with all other servers. Then we have a network of 100k users.

How do we manage search?

  • Query forwarding: Whenever somebody searches, their "home" server sends their query to all other servers who all reply
  • User forwarding: Each server sends a copy of some basic user info to all other servers

In the Query Forwarding model, if a server is offline, their users are not visible. Each server has to handle "load" for every user of every site. So to participate in the network hosting a small server with say me and 10 friends, I need to handle the search traffic for 100k users.

In the User Forwarding model each server needs to store all the users of the entire fediverse. In this example that means 100k user profiles. Then each server needs to receive and process a constant stream of updates from each other server. Every time a user changes some aspect of their profile this event is propagated to every single other server.

For our scenario of storing hosting offers, I presume profile updates are much less frequent than searches, so without doing any estimations user forwarding (2.) seems to create less traffic than query forwarding (3.) and most requests would be served locally. We could forward as little as a user id and their geographical location (longitude, latitude). Based on that we can display a pin on a map and when selected, maybe request more details from the user's home instance, located by their id (either directly with user@host scheme or with some address resolution service with uuids). With that request we can also choose how much details to reveal.

Query forwarding is what @chagai95 implemented for the demo, and with a few big instances only, maybe with some caching and batching, I can imagine it could work quite well for a while (and we may not ever get to this point of connectedness anyways). It also allows to restrict who can ever see a host.

If at some point we wanted to implement users (travelers) tracking, which would require updating their location (and an index) frequently, query forwarding might be better for that.

Also some other option/design might be better for publishing (broadcasting) of hosting requests. Unlike hosting offers, they are time-constraint. Maybe also different in some other ways…


I did also some estimations, more as an exercise then anything else. For users (hosting offers) forwarding (2.):

Let me know if I made any mistakes (quite probable!) or wrong assumptions.

chmac commented 2 years ago

@mariha Very cool. Bringing numbers into the discussion makes a lot of sense. Personally, I'm of the opinion that any network >1m people will fail for political / spam reasons. So the calculation of having a max 1m users with lat / long and some profile identifier at <50mb sounds totally reasonable to me.

I don't see any obvious issues with your calculations. Maybe there's extra data along with the identifier, some cache of the profile, or at least a name, trust score, etc. But that's also still small data, and only text. So it seems to me from these numbers that doing local search, and forwarding updates, makes a lot of sense.

There's also a privacy enhancement to searching your own data set locally. Either as a single node operator, or a network, I don't necessarily want to tell all other nodes which areas I'm interested in travelling to, how many visitors I have on my site, etc.

Based on HaS being the largest network currently with 170k users, an index or profile IDs and locations would be at most 5.7MiB (based on your 34k id + lat + lon). That's plenty small enough to be in memory on any system.

mariha commented 2 years ago

(there is actually an error 😎! latitude and longitude do not fit in 1b each, we may need 8b double float to store each of them (or 4b with some precision loose, maybe acceptable) - that makes an index with all HaS user ids and locations up to 7.8MiB which I think is also fine...)