:raising_hand: Enable STAC catalog search across the network

DACCS-Climate / DACCS-executive-committee

Activities of the Data Analytics for Canadian Climate Services (DACCS) executive committee

1 stars 1 forks source link

:raising_hand: Enable STAC catalog search across the network #23

Open mishaschwartz opened 8 months ago

mishaschwartz commented 8 months ago

Topic category

Select which category your topic relates to:

[x] software architecture
[ ] potential risks
[x] federation decisions
[ ] opportunities for growth
[ ] other

Topic summary

I have some thoughts about the issue of making stac catalogs aware of all other catalogs in the network.

We could make all catalogs a copy of each other but there are some issues with this:

keeping catalogs in sync is complex and resource intensive
Keeping the access permissions in sync is also resource intensive
The catalog data of the whole network is copied N times (one for each node) which is a huge waste of storage across the network
As the network grows, the data stored in every catalog becomes huge, this can affect search times for even simple searches

We could centralize the catalog but that goes against the mission of the project and doesn’t actually solve most of the resource/search problems.

Let’s go back to the reason that we want the catalogs to be aware of each other: we want to be able to search across the whole network.

I think that we could achieve this on the client side instead. We have two main clients that we need to deal with, pystac and the stac browser.

Pystac:

We can wrap pystac to allow for transparent search over multiple catalogs
We can improve performance by parallelizing search over multiple nodes
If we integrate this into the marble python client we can make authentication to multiple nodes easy and transparent for the user
This should all be possible without hacking or modifying pystac itself

Stac browser:

we’ll need to modify/update the browser to make multiple stac catalogs visible
The basic search functions will have to a modified to allow for search over multiple catalogs
This can open up the discussion to improve search more generally on the browser

By modifying the client side and leaving the stac catalogs themselves alone we can:

Not duplicate catalog data
Not need to sync data
Not need to sync access permissions
Keep search times/resources at a minimum for each node
Not need to centralize anything

I don't know if there will be any interest in modifying the stac browser but we could make the case for it.

Also, if we're already planning on building a new search interface for STAC (in order to integrate the NLP search component), we could just plan to create our own stac browser that supports multi-node search anyway.

Supporting documentation

Additional information

fmigneault commented 8 months ago

I don't think STAC catalogs should be copies of each other. It is fine to refer to another data hosting location if some STAC entries seem relevant for a given node (eg: shared use of common data for different studies), but duplicating them entirely is not useful. I think it is actually more useful to have catalogs specializes on different aspects, so search results are less noisy.

In other words, if UofT's STAC catalog point to data hosted on CRIM's STAC, that is fine. The STAC Collection/Item on UofT can be a copy (except self and alternate links maybe) of the ones on CRIM, and CRIM manages the data hosting of the Assets referenced by them. The data does not need to be duplicated, nor does the permissions. If access to the data is protected, the STAC definition should employ https://github.com/stac-extensions/authentication with references for the data hosting location (aka provide Magpie login endpoints and auth methods). Whether we help users sync between nodes (aka Magpie Network mode) is another discussion, but the authorization is still performed against the node hosting the data, regardless of where the STAC metadata is catalogued. That auth strategy could be facilitated using https://github.com/stac-utils/stac-asset.

When doing a search with pystac_client, it yields the pystac.Item matching the search criteria. One can already do a set join of the search results across multiple catalogs by passing the queries to distinct pystac_client instances targeting each desired STAC node. Which nodes are of interest for search is somewhat a user preference. Maybe adding a tutorial that retrieves the STAC URLs for the various nodes from the node registry, and pipe them into distinct pystac_client instances for a "network search" could be sufficient?

For the STAC browser aspect, I think it would be much easier to copy the STAC Collection/Item definitions from other catalogs instead of string to support multi-catalog searches, even though the data itself would not be hosted on the node running that STAC browser. Integration of the STAC Item/Collections would be a simple cron job running https://github.com/stac-utils/pgstac or another derived utility in https://github.com/stac-utils for managing STAC databases.

mishaschwartz commented 8 months ago

@fmigneault

I think that we agree:

STAC catalogs should not be copies of each other
We can search with the pystac client over multiple catalogs

For the STAC browser aspect, I think it would be much easier to copy the STAC Collection/Item definitions from other catalogs

This is the issue that I think we need to think about. It is easy to set up a cron job but that's not the problem, the problem is that copying items and collections from other catalogs is resource intensive (lots of network IO to access the STAC item definitions hosted on all other nodes in the network) and then each node needs to store the STAC items from every other node in their database.

I think it is actually more useful to have catalogs specializes on different aspects, so search results are less noisy.

That makes sense. But we also want to allow users to choose to search the entire network if they'd like as well. A user may not know what kind of data is hosted on which node initially and we want to allow them to explore.

I'm also fine if an individual node's STAC browser only displays data from their own node. But we had discussed allowing users to search the entire network for data. We can do that with pystac but I think that we should have some GUI somewhere that is a bit more user friendly for non-technical users.

fmigneault commented 8 months ago

This is the issue that I think we need to think about. It is easy to set up a cron job but that's not the problem, the problem is that copying items and collections from other catalogs is resource intensive (lots of network IO to access the STAC item definitions hosted on all other nodes in the network) and then each node needs to store the STAC items from every other node in their database.

How about a request hook that works the other way around? When a Collection/Item is posted on a given STAC API, it would send the information to another "global" STAC API that aggregates them all. That obviously requires each instance to implement it, but it is less resource extensive. Could use the pre/post-request hooks like Weaver does which are triggered for specific Magpie/Twitcher services: https://github.com/bird-house/birdhouse-deploy/blob/master/birdhouse/components/weaver/config/magpie/config.yml.template

mishaschwartz commented 8 months ago

When a Collection/Item is posted on a given STAC API, it would send the information to another "global" STAC API that aggregates them all. That obviously requires each instance to implement it, but it is less resource extensive.

We could do that. My worries with this are:

this doesn't eliminate the need to search multiple stac APIs, a user would still search the local one (for their node) and a global one.
if we start implementing a global stac API then we create a single point of failure for the network. One of our goals of creating a federated network of nodes was to not create anything global so that the network is resilient in case one node goes down.
I'm not sure how we'd easily implement access controls for the global STAC API. If a node has stac items that are protected by magpie for certain users, that level of access control can't be implemented on the global stac. Unless we also have a global magpie instance (which I definitely don't want for the same resiliency reasons).

fmigneault commented 8 months ago

this doesn't eliminate the need to search multiple stac APIs, a user would still search the local one (for their node) and a global one.

On the contrary. If definitions were POST'd to a centralized STAC API/browser, there is no need to search multiple ones anymore. Once a Collection/Item of interest is found on the global STAC, the Asset/Links they defined are directly accessed on the STAC instance that hosts the data. It is not necessary to search again.

The downside is that we must rely either on other instances to provide the request, or this centralized STAC crawls the other APIs periodically. Either approach as advantages and disadvantages, depending on whether we want more or less request traffic.

if we start implementing a global stac API then we create a single point of failure for the network.

That is not true. A global STAC could be load-balanced with multiple replications and instances. There a many ways to work around that.

We cannot rely on the nodes in the Marble federated network (in its current state) in the sense of "providing replications" of the same data/services or "fallback endpoints", because each node fundamentally provides different services and datasets. If a node goes down, the network is not more resilient, since the unique data is not accessible anyway. There are only multiple points of failure at the moment.

At this time, I believe the concern is about finding ways to aggregate all available information somehow (especially the one that differs between nodes). For example, ESGF uses Metagrid to accomplish this. For Marble, nothing is defined yet. This is not to be confused with node replication, which is a whole different concern, and for which each individual Marble node could resolve using its own array of subnodes for reliability/replication.

I'm not sure how we'd easily implement access controls for the global STAC API

Very complicated indeed. However, I believe this lies more in the hands of a federated auth service than STAC. I would also limit how much STAC endpoints should actually be protected. In the end, it is only the metadata that is provides. In a way, a global STAC does not even need to use the user's auth. It could have its own auth to lookup data offered by other nodes. The important aspect to protect in STAC are actually the Assets, not the Collection/Items returned by the API. The Assets can have many more auth considerations that only the API/URI, such as interacting with a S3 bucket or other protected providers auths (eg: Copernicus). This is not even a concern for Magpie alone, global STAC or not.

mishaschwartz commented 8 months ago

Yes, something like metagrid which would allow us to search across the network would be great!

I mentioned before:

Also, if we're already planning on building a new search interface for STAC (in order to integrate the NLP search component), we could just plan to create our own stac browser that supports multi-node search anyway.

so implementing some search interface like metagrid that we could add the NLP interface into later on would be perfect.

I still think that having a centralized stac api doesn't make sense for this project:

A global STAC could be load-balanced with multiple replications and instances. There a many ways to work around that.

Sure, we can always make replicas but who is going to maintain these?

We don't have any centralized architecture because we don't want a critical part of the network to deteriorate once we run out of funding.

The network needs to be self-sustaining and that means that everything should work with just the nodes in place.

I would also limit how much STAC endpoints should actually be protected. In the end, it is only the metadata that is provides.

Yes, but there is a use-case for protecting meta-data as well and I think it is likely that some nodes will want to be able to do so.

fmigneault commented 8 months ago

Yes, we could have custom strategies for NLP search, but STAC also has the q=... query parameter that is very common for text based search on many websites. What runs behind the scene for that query is left up to the implementer. I was figuring the NLP would plug nicely in there, so it doesn't strictly need to involve a custom/global interface/instance.

I would like a solution that doesn't rely on a global instance as well. However, this is the only option I can see as tradeoff to dispatching search requests to all nodes each time. Doesn't mean that 2nd approach is bad, or that the 1st is better for that matter, just listing potential solutions. It seems that whether we use a global STAC replicating remote Items/Collections, or some custom interface that queries all the STAC nodes, we have some kind of "central portal" no matter what. If a custom portal is planned for implementation regardless, then this could be the best choice. If not (or as mentioned, to avoid a critical/central architecture), then the "quick" workaround is a global STAC that simply duplicates Items/Collections, since the "source" Items/Collections remain available on the respective nodes even if the central one goes down, and it doesn't involve a custom UI implementation to aggregate searches.

For either solution, I guess it would be the same organization maintaining it, whether it is a global STAC or a central metagrid-like interface. More load-balalanced instances could be added, and maintained by many organizations, but I personally don't think we are at that point yet. The important aspect I want to highlight is that we must distinguish "network nodes" from "instance replicas". For the time being, I consider Hirondelle, PAVICS and RedOak to be "network nodes" (as it should), but by no mean replicas.

Yes, but there is a use-case for protecting meta-data as well and I think it is likely that some nodes will want to be able to do so.

I agree, but other nodes probably don't need to know about it if it is not public, unless there is already something in place to provide federated logic, or the user already has a login for that node.