Decentralize directory - Githubissues

SpaceApi / directory

The Space API directory, modification through Pull Request

https://spaceapi.io/directory/

58 stars 169 forks source link

Decentralize directory #144

Open gidsi opened 4 years ago

gidsi commented 4 years ago

I would like to propose an addition to the directory to maybe make the repository obsolete in the future and get us out of the picture.

I would leave the directory.json as it is for now, but would like to enable spaces to provide the endpoint file in a standardized place (e.g. https://spacewebsite.com/spaceapi.json, comparable to the robots.txt)

I would also give them the opportunity to change the default either in a meta tag or as an http header on their main site, e.g.

As part of html

<html>
  <head>
    <meta name="spaceapi" content="https://example.com/myspaceapi/status.json">
  </head>
    ...
</html>

As a http header

< HTTP/2 200
< content-type: text/html
< server: nginx
< x-spaceapi-endpoint: https://example.com/myspaceapi/status.json

Also it would be necessary to introduce a new field in the schema, something like this

{
  "connectedTo": [
    "https://chaospott.de",
    "https://example.com",
    "https://chaosdorf.de",
    "https://c3l.lu"
  ]
}

This way we could crawl through the different endpoints and spaces won't have the need to create a pull request (i would like to automatically create pull requests to the ones we find to have them in the static file too).

I would use a default position so spaces won't need to have the endpoint of the space they want to link (it changes sometimes and then they would have old and stale data that would need to be updated), also that would mean that you could add spaces to the list (that you know) and we could automatically add them the moment they provide one.

The static file could become irrelevant and we would have a network instead (the data itself would also be quite interesting), and if this doesn't work everything would kinda like stay the same so the risk is low in my opinion.

Also something like a "meta organization" would be possible in some way too.

It might be easier for people to join since you can contact a space that will add you, you won't need a github account anymore and you don't know how to work with git.

What do you guys think?

rnestler commented 4 years ago

I'm all for decentralizing and it sounds interesting (I remember we also talked about it at a core meeting, no?) But one will always need a central list to bootstrap no? Otherwise we may end up with incomplete directories. Also for an app it is nice to have a centralized list. But this service could just crawl all the spaces in the background.

dbrgn commented 4 years ago

Ok, so the idea is to have a network of interlinking between space endpoints, similar to the "blogrolls" back when blogging was en vogue?

While the idea is interesting, I suspect that we'd have even more dead links or invalid endpoints than with a central directory. With the central directory, when you create a new space, you just update the JSON file and are done. With the decentralized directory, you first have to contact a space (you might not know the people there personally yet), find out who the sysadmin is, then ask that person to list/endorse you even if that person does not even know you. I don't think this is a simplification. I also think that maintenance of those links will not be done often, we already have trouble getting people to keep their own spaceapi endpoint up to date.

Also, for an application like MyHackerspace, crawling all spaces on every app startup is not feasible either, so it would still depend on a central directory, but that central directory would now be dependent on an ever-changing network of potentially outdated websites. If you have two different such caching services, they would likely contain different URLs.

Also, you could not get rid of the directory entirely, because you still have the bootstrapping issue. (See DHT based torrent clients that usually have a list of seed nodes hardcoded.)

If we want to decentralize the directory, I would rather tend towards a federated approach with multiple directories (but not without them). For example, in Switzerland the CCC-CH could host a directory with all member associations. In Finland, the Hacklab-People could host an endpoint for their members. Same thing for the CCC in Germany. Also, it doesn't need to be a formal organization, anyone could set up a subdirectory for a certain list of spaces.

This is quite similar to the approach you suggested, but it still differentiates between space endpoints and directories. We could also ask subdirectories to link back to the main directory, which would allow applications to use any subdirectory as an "entry point" into the federated network. If the main directory is not maintained anymore, then subdirectories would simply drop that from their list and applications pick a former subdirectory as the new entrypoint.

Of course this does not mean that we should not standardize the endpoint URL, however then it should use the .well-known directory. That way, just the plain domain of the directory server would be sufficient.

This would mean we'd still have some kind of centralization for initial discovery (which helps mobile applications a lot), but managing the members of the directory itself would be distributed. (Note that this can also mean that we'll have certain federated subdirectories that might be much less well maintained than the central one and could contain dead URLs etc).

gidsi commented 4 years ago

Ok, so the idea is to have a network of interlinking between space endpoints, similar to the "blogrolls" back when blogging was en vogue?

Yeah, kinda, of course the idea is not new :)

While the idea is interesting, I suspect that we'd have even more dead links or invalid endpoints than with a central directory. With the central directory, when you create a new space, you just update the JSON file and are done. With the decentralized directory, you first have to contact a space (you might not know the people there personally yet), find out who the sysadmin is, then ask that person to list/endorse you even if that person does not even know you. I don't think this is a simplification. I also think that maintenance of those links will not be done often, we already have trouble getting people to keep their own spaceapi endpoint up to date.

Also, for an application like MyHackerspace, crawling all spaces on every app startup is not feasible either, so it would still depend on a central directory, but that central directory would now be dependent on an ever-changing network of potentially outdated websites. If you have two different such caching services, they would likely contain different URLs.

I would keep the list, i wouldn't make it the "truth" anymore, but i would still keep it as a service. What would be the problem if there are different caching services containing different?

Yeah, we might have more invalid or dead links, but since we've shifted the truth over to the network we can remove and add them more easily without having to discuss under which circumstances we're changing the truth. If a space is invalid, just remove it, not reachable, remove it.

We can also keep scraping events and just scrape new endpoints that got added.

Also, you could not get rid of the directory entirely, because you still have the bootstrapping issue. (See DHT based torrent clients that usually have a list of seed nodes hardcoded.)

But wouldn't this make it already better? Right now we're basically the only "seed node".

If we want to decentralize the directory, I would rather tend towards a federated approach with multiple directories (but not without them). For example, in Switzerland the CCC-CH could host a directory with all member associations. In Finland, the Hacklab-People could host an endpoint for their members. Same thing for the CCC in Germany. Also, it doesn't need to be a formal organization, anyone could set up a subdirectory for a certain list of spaces.

This is quite similar to the approach you suggested, but it still differentiates between space endpoints and directories. We could also ask subdirectories to link back to the main directory, which would allow applications to use any subdirectory as an "entry point" into the federated network. If the main directory is not maintained anymore, then subdirectories would simply drop that from their list and applications pick a former subdirectory as the new entrypoint.

I don't like the "hierarchy" of it. I know it's not meant like this but i'm pretty sure people will use it that way sooner or later.

Of course this does not mean that we should not standardize the endpoint URL, however then it should use the .well-known directory. That way, just the plain domain of the directory server would be sufficient.

Sounds good! I would like to do that too. I would still use the URL the space provided as the website, but you could also have a 308 from there to the correct URL. Do you know what we need for a registration?

dbrgn commented 4 years ago

What would be the problem if there are different caching services containing different?

It means that different apps showing "spaces from the SpaceAPI" will show different spaces. It generates confusion.

I would keep the list, i wouldn't make it the "truth" anymore, but i would still keep it as a service.

Would we still encourage spaces to enter their data into our directory? If yes, I don't really see what we would win from the decentralization, because everyone will simply keep using our directory for querying spaces.

I don't like the "hierarchy" of it. I know it's not meant like this but i'm pretty sure people will use it that way sooner or later.

I see why you saw a hierarchy in there, but that's not how it was meant. The federated directories would be on the same level, you could use anyone as an entry point (as long as they are properly interlinked).

(I do see a problem though with the multi-directory approach though. How should conflicts be resolved if the same space was entered into multiple directories with differing endpoints?)

Do you know what we need for a registration?

We would need a formal IETF-style specification essentially describing the SpaceAPI. I'm not sure "check out this JSON schema" would suffice. However, we don't need a registration (it's a nice thing anyways), we can just use that directory as a convention.

dbrgn commented 4 years ago

Thinking about it some more, I think we need to answer the fundamental question whether the SpaceAPI should be a clearly defined spec + directory, or whether it should all be a decentralized thing where truth is found by aggregating the information from network and where "different truths" can coexist.

If there is no single truth about the endpoints anymore, what about the spec itself? We have taken steps to make it more open (by saying that all keys can be used freely), but we still have a clear spec.

I'm not sure it makes sense to have a completely decentralized, multi-truth network of endpoints while having a centrally developed specification. I see two ways how the project could work:

Variant A: Centralized but open

The directory stays in a single place (the way it's handled right now). This is a big advantage for developers because they don't need elaborate crawling and conflict resolution tools, they can simply consume the directory. It's also an advantage for data quality, because we can easily remove old entries and do some general maintenance.

The spec also stays in a single place. The core is clearly defined by the SpaceAPI maintainers, but all spec changes are openly discussed. Everybody can propose changes and you can use non-standard fields if you wish. This has the advantage that there's a "single source of truth", but it's also slightly more hierarchical (although as we've seen the project can be forked if it dies).

Variant B: Decentralized database of keys and values

I'd summarize this as "OpenStreetMap" style. The OSM project is not primarily a map. OSM is a database of loosely specified key-value mappings tied to geographical objects. There is no single-source-of-truth specification that says how streets or houses should be tagged. There is a wiki, where people add proposals. Sometimes multiple tagging schemas are used at the same time. Renderers have to integrate all of them, or pick one. Sometimes people add data in a certain way because certain renderers render them in a certain way.

This is more loose than what we do at the SpaceAPI right now. There is no OSM spec besides the wiki. The SpaceAPI spec could also be abolished and replaced with a continuously developed wiki of suggestions for key-values pairs. Consumers of the data may assume nothing about the fields, fields are consumed in an opportunistic style (use it if it's there and in the right format, ignore it otherwise).

To develop the schema, people from any space could propose RFCs with "data schemas" like the OSM tagging schemas. People could list such RFCs in the wiki. There is no process for RFCs being accepted or rejected, but people can adopt them if they like them.

Similarly, the directory would be established not by a central list, but by making your endpoint available at a certain location (the .well-known directory). Space discovery is done through interlinking or through inofficial lists.

This would be a completely open thing not controlled by any single party in any way. However, it would make it significantly harder to both provide an endpoint (because there is no "doing it right" because there's no single source of truth) and to consume the data (because you cannot assume anything about the data format). It might result in more robust clients that won't crash when invalid data is fed to them (although that hasn't been an issue so far).

Summary

So far that was mostly just brainstorming. But I think we should do a conscious choice on what we want to be. Decentralization is not a feature in itself, it's a way to achieve certain goals (like not having someone control the project and to make the network more resilient). However, I'm not sure we have a use case that benefits from this type of decentralization. I tend to favor the "clearly defined, centralized, but transparent, open and forkable" model.

gidsi commented 4 years ago

It means that different apps showing "spaces from the SpaceAPI" will show different spaces. It generates confusion.

I don't think the differences will be so big. I'm pretty sure that we will end up in a single big network, not in small different ones.

Would we still encourage spaces to enter their data into our directory? If yes, I don't really see what we would win from the decentralization, because everyone will simply keep using our directory for querying spaces.

In the beginning yes, up until we are at the point that we will have a big cluster, then we could think about getting rid of it and just provide the dynamic directory / api.

I see why you saw a hierarchy in there, but that's not how it was meant. The federated directories would be on the same level, you could use anyone as an entry point (as long as they are properly interlinked).

Can you provide an example how it would look like? Are there multiple directories pointing to different endpoints and pointing to different directories?

(I do see a problem though with the multi-directory approach though. How should conflicts be resolved if the same space was entered into multiple directories with differing endpoints?)

I think thats solvable due to pointing to the website / .well-known. You could use them internally to filter the spaces (either on the url we're loading the SpaceAPI file from or filtering based on the website, i was also thinking about having a unique id per space, but so far i'm not sure about it) and call it good.

gidsi commented 4 years ago

Thinking about it some more, I think we need to answer the fundamental question whether the SpaceAPI should be a clearly defined spec + directory, or whether it should all be a decentralized thing where truth is found by aggregating the information from network and where "different truths" can coexist. snip

I would go with a different approach, right now i would decentralize the directory but keep the schema.

I would still keep the directory for now and just use the connectedTo field as a way to add more spaces since we can just check the files and add them automatically to the directory.json.

If we've figured out that there is a solid network we might be able to ditch it, but i think thats a discussion for another time.

I think we shouldn't mix up the discussion about the directory and the schema.

The single source of truth would be there, the network is the source. Everything else wouldn't be the truth more like a caching service.

I would not like to force people to crawl all endpoints nor implement a system against a key/value json file that is not standardized in any way in every single app. So my approach would still be the same, decentralize the endpoints, standardize the endpoints with the schema files and provide the caching service with all spaces providing a valid SpaceAPI file.

dbrgn commented 4 years ago

With your proposal, on our website, would we ask people to add themselves to the SpaceAPI by creating a PR against our directory, or would we suggest to ask a few other spaces they may know to list them? If both, do you have a suggested wording that could be used to explain why that should be done?

When updating our directory by crawling, how would we deal with duplicate URLs for the same space? Let's say that Space A links to an old website of space Z while Space B links to the new (and correct) website of space Z. The crawler would then encounter conflicting information about the same space. Which URL would the directory crawler pick and how would that decision be made?

Would the connectedTo field contain the space names as well, or only the URL? Doing the latter could reduce problems with conflicting information. Also, if the names were contained, space renames could be difficult because the old name could get re-added.

The single source of truth would be there, the network is the source.

You can still have conflicting information in the network :slightly_smiling_face: That may not be a problem if we have good strategies on how to deal with those conflicts, but distributed/decentralized systems are fundamentally more complex than centralized ones so we need to think of all the ways it could go wrong / result in conflicts and think about how to handle them.

gidsi commented 4 years ago

With your proposal, on our website, would we ask people to add themselves to the SpaceAPI by creating a PR against our directory, or would we suggest to ask a few other spaces they may know to list them? If both, do you have a suggested wording that could be used to explain why that should be done?

We would still ask them to create a PR up until the network is big enough and implemented, when we feel ready we would switch the communication. Up until then the connectedTo field would be more like advertisement for other places/data for showing a network.

When updating our directory by crawling, how would we deal with duplicate URLs for the same space? Let's say that Space A links to an old website of space Z while Space B links to the new (and correct) website of space Z. The crawler would then encounter conflicting information about the same space. Which URL would the directory crawler pick and how would that decision be made?

I would add both, the spaces should remove their old files, files that are not reachable i would remove from the directory/caching service.

Would the connectedTo field contain the space names as well, or only the URL? Doing the latter could reduce problems with conflicting information. Also, if the names were contained, space renames could be difficult because the old name could get re-added.

URLs only, providing the name wouldn't make a lot of sense to me since they could be provided dynamically if needed.

You can still have conflicting information in the network slightly_smiling_face That may not be a problem if we have good strategies on how to deal with those conflicts, but distributed/decentralized systems are fundamentally more complex than centralized ones so we need to think of all the ways it could go wrong / result in conflicts and think about how to handle them.

Even if it's conflicting it would still be the truth, how your service interpret it is still an interpretation :) But yeah, which strategies to use there might change the outcome, but that would something i would talk about if we have to deal with inconsistencies. But if you want to we could also make an effort to find way's of inconsistencies first and provide recommended strategies to deal with them.

dbrgn commented 4 years ago

The consensus of today's meeting was to spec a linked_to key as a schema extension. The key would also include optional semantics on what the relation means.

DougInAMug commented 1 year ago

https://murmurations.network may be of interest to people in this thread!