NASA-IMPACT / veda-data

2 stars 0 forks source link

Convert CO2 Pilot Budget countries nc to COG. #19

Closed slesaad closed 9 months ago

slesaad commented 10 months ago

PI Objective

Description

JPL has requested to add a country wise budget data to the CO2 budgets dataset.

I think that it would be interesting to display the countries in addition to the gridded dataset. These data come in a separate netcdf (or csv) file.

The fie is located in the following location, we need to convert it to a COG to be able to publish it to the catalog and then to the dashboard.

Here's the script used to convert gridded budget datasets: https://github.com/US-GHG-Center/ghgc-docs/blob/main/cog_transformation_scripts/nasa_ceos_co2_flux_data_transformation.py, could probably reuse this with some changes to get the transformation done.

Acceptance Criteria

jpolchlo commented 10 months ago

If I'm reading the data correctly, this is not a gridded file. How do we want to go about rasterizing the content of this file? Am I missing something?

Edit: I'll provide a little more context. These are tabular data indexed by year (in this case, 2015–2020), and by country (201 are included in this file). There are a variety of variables, some of which are over year and country, and some are over country alone. In order to create a COG of this file, we can rasterize over a specific grid if we have good geometry for each country in the dataset. On the other hand, these data are ~430Kb in size, and the rasterized version will obviously be much larger.

If it still makes sense, I'll continue down the rasterization road, but we might want to think about how to simply work with these data as they are?

slesaad commented 10 months ago

Sorry I didn't realize that, my bad, maybe this can be vector data that can be served via https://github.com/NASA-IMPACT/veda-features-api

@smohiudd @ranchodeluxe how hard is that?

ranchodeluxe commented 10 months ago

Unfortunately the Fire Team thinks it would be strange to have other collections in their instance since they've been telling folks all the data in that API is related to Fire Things ™️. But I agree with them

IMHO the next best option here is to generalize the IaC of veda-features-api and the vector ingest in veda-data-airflow to handle multiple clients. More details below

Generalize veda-features-api IaC (LOE: a few days to a week if unbothered)

Some folks might want to create another repo for this request. But that's not a sustainable pattern. I'd rather do this correctly in one repo. So the goal is to refactor the TF into reusable local modules so that we can apply it for different clients: think something along the lines of terraform apply -var="client=eis-fire||veda-general". We'd also refactor the CI/CD to handle deploys for different clients.

  1. Cross walk what Saadiq did for GHGC features API generalization and take the good parts.

  2. Break shared things in TF into local modules

  3. I wouldn't worry about including any observability and ADOT stuff for now (so toggle this to false)

  4. Change CI/CD to deploy for multiple clients

  5. All the changes above have to happen without breaking the existing site 😉

Generalize veda-data-airflow (LOE: a day)

  1. Basically we'd sniff the s3 URL and branch for clients based on where the data is coming from in this section.

  2. Handle table prefixes for load_to_featuresdb function per client

  3. Handle table prefixes for alter_indices function per client

j08lue commented 10 months ago

What is the "client" here - where we would branch out into veda-general and eis-fire?

I think in general we should not host several instances for different groups, if we can avoid it. Perhaps tagging collections or including the originating programme in the metadata could fulfil the need of a given team to show only "their" datasets?

ranchodeluxe commented 10 months ago

What is the "client" here - where we would branch out into veda-general and eis-fire?

Yes right now it would be something like veda-general and eis-fire. But I was thinking about GHGC too which is currently a fork and doesn't really need to be if we implement what was talked about above. In summary, we're thinking about two patterns here. One is about the "is instance-of" VEDA where reuse is at the IaC level and the second is about reuse behind the scenes of various infrastructure so that clients feel cozy about having their own "data" and "API" (regardless of how it's actually structured on the backend).

I think in general we should not host several instances for different groups, if we can avoid it. Perhaps tagging collections or including the originating programme in the metadata could fulfil the need of a given team to show only "their" datasets?

I like how you're thinking @j08lue. It sounds like you're talking about multi-tenancy if i squint 😉. I'm up for looking more into this and talking about how this would work. Currently tipg doesn't handle any of these concerns. So the problem to solve here is about having multiple domains "reuse" the API but talk to different catalogs. And I do think we should investigate this. At some level there has to be a client (think domain) to tag/filter mapping. It might even be done at the schema level

ranchodeluxe commented 10 months ago

Currently groking the OGC spec for anything that deals with tenancy. But it seems slim pickings over there. @jsignell mentioned to look at "nested collections" as a way to do this type of filtering. Seems like there might be something there but that nested collections don't respond well to LIMIT query params https://docs.ogc.org/is/17-069r3/17-069r3.html (of course OGC docs don't use anchors 🙄 b/c that would be too convenient)

ranchodeluxe commented 10 months ago

Here's an example of the first segmentation pattern -- reuse of the VPC, ALB and DB via separate schemas using TIPG_DB_* os environment variables. This wasn't hard:

Architecture:

Screen Shot 2023-08-26 at 9 08 43 AM

Changeset:

Particular attention to: https://github.com/developmentseed/eoapi-k8s/pull/28/files#diff-adf02de401c721c0b928ecdbfb8ee018b21988bb64ed9cd0b42bfa3353b86838R24-R50

Examples:

http://vector11.wfs3labs.com/collections http://vector12.wfs3labs.com/collections

But this isn't quite good enough 😉 We can save more $$$ and get more reuse at the API service/pod level too if we do dynamic secrets/config lookups in the request/response lifecycle based on domain header information 👨‍🍳

ranchodeluxe commented 10 months ago

Second pass of reuse here is done ✨ Now multiple domains will reuse the API service layer and do dynamic catalog lookups per request/response lifecycle agnostically

Architecture

Screen Shot 2023-08-26 at 9 08 49 AM

Deployment Patterns for Upgrades and Scary Changes (use both architectures):

Screen Shot 2023-08-26 at 9 10 17 AM

Changelog:

Particular attention paid to this: https://github.com/developmentseed/eoapi-k8s/pull/28/files#diff-ae7380e05b96e0d9ce5e869f858c8e0d2fb96a6d2fac5eadec9693e2a6390c94R24-R25

https://github.com/developmentseed/tipg/compare/main...ranchodeluxe:tipg:feature/schema_middlewares?expand=1

Examples:

http://vector1.wfs3labs.com/collections http://vector2.wfs3labs.com/collections

Let me finish this to a degree and then put up some architecture diagrams

j08lue commented 9 months ago

Let us please move the latest discussion out to a dedicated ticket, to give it better visibility / tracking, @ranchodeluxe