NERC-CEH / dri_gridded_data

GNU General Public License v3.0
0 stars 0 forks source link

API access #24

Open mattjbr123 opened 2 days ago

mattjbr123 commented 2 days ago

One of the things that came out of the work package meeting on 22-10-2024 is that API access to data stored on the object storage is not explicitly included in the workflow diagram.

From my perspective I see this as arising because there is some confusion over what 'API Access' is meaning in this context. The plan is that the data on the object store will be available publically from anywhere (firewalls local-to-the-user not withstanding), and to have code provided on the data catalogue page and/or a link to an analysis platform with such code, that would allow the user to essentially treat the entire dataset as if it was on their own local filesystem. Is this an API? Is something extra needed to make it an API if not? And do we actually need that something extra? These are the questions that need clarification to me.

mattjbr123 commented 9 hours ago

Some helpful thoughts from @fsamreen:

A few thoughts …

Option 1 - provide users with direct URLs to download/interact with the data stored in the S3 bucket. Pro - it is easier to set up, faster access and might be quick to download but direct access can expose sensitive data or bucket structures. Cons - difficult to enforce controlled access which might be needed due to various reasons (security, data moving cost, etc). Option 2 - build a REST API that acts as an intermediary and users interact with the API – handle data requests, process them, and retrieve the data from S3. Pros- more control over who can access what and how (even we could give controlled access to some datasets through authentication methods). We will have to run API server (management overhead). Cons - additional development task to implement APIs. Integration with other services would be secure.

An important question here is – ‘Who are the users and how would they like to interact with the data’? followed by - What is a sustainable value-added solution without excessive cost, maintenance and unnecessary implementation overheads? We might end up offering access through various methods including direct access to S3 as well as APIs or even through DataLabs.