Closed slesaad closed 9 months ago
If I'm reading the data correctly, this is not a gridded file. How do we want to go about rasterizing the content of this file? Am I missing something?
Edit: I'll provide a little more context. These are tabular data indexed by year (in this case, 2015–2020), and by country (201 are included in this file). There are a variety of variables, some of which are over year and country, and some are over country alone. In order to create a COG of this file, we can rasterize over a specific grid if we have good geometry for each country in the dataset. On the other hand, these data are ~430Kb in size, and the rasterized version will obviously be much larger.
If it still makes sense, I'll continue down the rasterization road, but we might want to think about how to simply work with these data as they are?
Sorry I didn't realize that, my bad, maybe this can be vector data that can be served via https://github.com/NASA-IMPACT/veda-features-api
@smohiudd @ranchodeluxe how hard is that?
Unfortunately the Fire Team thinks it would be strange to have other collections in their instance since they've been telling folks all the data in that API is related to Fire Things ™️. But I agree with them
IMHO the next best option here is to generalize the IaC of veda-features-api and the vector ingest in veda-data-airflow to handle multiple clients. More details below
Some folks might want to create another repo for this request. But that's not a sustainable pattern. I'd rather do this correctly in one repo. So the goal is to refactor the TF into reusable local modules so that we can apply
it for different clients: think something along the lines of terraform apply -var="client=eis-fire||veda-general"
. We'd also refactor the CI/CD to handle deploys for different clients.
Cross walk what Saadiq did for GHGC features API generalization and take the good parts.
Break shared things in TF into local modules
I wouldn't worry about including any observability and ADOT stuff for now (so toggle this to false
)
Change CI/CD to deploy for multiple clients
All the changes above have to happen without breaking the existing site 😉
Basically we'd sniff the s3 URL and branch for clients based on where the data is coming from in this section.
Handle table prefixes for load_to_featuresdb
function per client
Handle table prefixes for alter_indices
function per client
What is the "client" here - where we would branch out into veda-general
and eis-fire
?
I think in general we should not host several instances for different groups, if we can avoid it. Perhaps tagging collections or including the originating programme in the metadata could fulfil the need of a given team to show only "their" datasets?
What is the "client" here - where we would branch out into
veda-general
andeis-fire
?
Yes right now it would be something like veda-general
and eis-fire
. But I was thinking about GHGC too which is currently a fork and doesn't really need to be if we implement what was talked about above. In summary, we're thinking about two patterns here. One is about the "is instance-of" VEDA where reuse is at the IaC level and the second is about reuse behind the scenes of various infrastructure so that clients feel cozy about having their own "data" and "API" (regardless of how it's actually structured on the backend).
I think in general we should not host several instances for different groups, if we can avoid it. Perhaps tagging collections or including the originating programme in the metadata could fulfil the need of a given team to show only "their" datasets?
I like how you're thinking @j08lue. It sounds like you're talking about multi-tenancy if i squint 😉. I'm up for looking more into this and talking about how this would work. Currently tipg
doesn't handle any of these concerns. So the problem to solve here is about having multiple domains "reuse" the API but talk to different catalogs. And I do think we should investigate this. At some level there has to be a client (think domain) to tag/filter mapping. It might even be done at the schema level
Currently groking the OGC spec for anything that deals with tenancy. But it seems slim pickings over there. @jsignell mentioned to look at "nested collections" as a way to do this type of filtering. Seems like there might be something there but that nested collections don't respond well to LIMIT
query params https://docs.ogc.org/is/17-069r3/17-069r3.html (of course OGC docs don't use anchors 🙄 b/c that would be too convenient)
Here's an example of the first segmentation pattern -- reuse of the VPC, ALB and DB via separate schemas using TIPG_DB_*
os environment variables. This wasn't hard:
Particular attention to: https://github.com/developmentseed/eoapi-k8s/pull/28/files#diff-adf02de401c721c0b928ecdbfb8ee018b21988bb64ed9cd0b42bfa3353b86838R24-R50
http://vector11.wfs3labs.com/collections http://vector12.wfs3labs.com/collections
But this isn't quite good enough 😉 We can save more $$$ and get more reuse at the API service/pod level too if we do dynamic secrets/config lookups in the request/response lifecycle based on domain header information 👨🍳
Second pass of reuse here is done ✨ Now multiple domains will reuse the API service layer and do dynamic catalog lookups per request/response lifecycle agnostically
Particular attention paid to this: https://github.com/developmentseed/eoapi-k8s/pull/28/files#diff-ae7380e05b96e0d9ce5e869f858c8e0d2fb96a6d2fac5eadec9693e2a6390c94R24-R25
http://vector1.wfs3labs.com/collections http://vector2.wfs3labs.com/collections
Let me finish this to a degree and then put up some architecture diagrams
Let us please move the latest discussion out to a dedicated ticket, to give it better visibility / tracking, @ranchodeluxe
PI Objective
Description
JPL has requested to add a country wise budget data to the CO2 budgets dataset.
The fie is located in the following location, we need to convert it to a COG to be able to publish it to the catalog and then to the dashboard.
Here's the script used to convert gridded budget datasets: https://github.com/US-GHG-Center/ghgc-docs/blob/main/cog_transformation_scripts/nasa_ceos_co2_flux_data_transformation.py, could probably reuse this with some changes to get the transformation done.
Acceptance Criteria