Closed dchandan closed 12 months ago
To do: Datamodel extension wasn't working properly from the very start. We still have to get it to work.
@dchandan
Not sure if you were already aware of this, but the is a STAC Extension for CF attributes: https://github.com/stac-extensions/cf.
I believe this is the notation that CFJsonItem
should use, not specifically applicable for CMIP6 or Datacube extensions, but any CF attribute used in NetCDF.
All code in implementations
should be moved under STACpopulator
so that they can be imported externally.
Right now, they are not installed in the same module, which makes it unusable by other code base that could reuse the implementations (eg: https://github.com/crim-ca/ncml2stac/blob/main/notebooks/ncml2stac.ipynb).
For this notebook, I am particularly interested in avoiding to duplicate the logic in CMIP6populator.create_stac_item
.
@dchandan I realize based on your comments that I may have not communicated properly the fact that the CMIP6 extension was meant to replace the Datamodel logic, not duplicate it. I just didn't remove yet the logic in add_CMIP6.py while I was still experimenting with extensions.
So extensions might be a bad idea, but not because it replicates existing code (points 2.1 and 2.4).
Also, I did not have to define a schema to get it to work, so not sure about this being an additional complication. If needed, pydantic can spit out the schema for us.
Basically, embedding the CMIP properties in an extension is just a way to formalize how we're adding CMIP properties into the STAC item, so that we're using the same mechanism to "extend" STAC items (datacube, cmip6, and eventually maybe other extensions).
@huard Thanks for the clarification!
I think that the scheme URI is required. In your code it was specified in SCHEMA_URI
and used later by the extension via the class method get_schema_uri
. Since that specific URI does not work (I get a 404), I am inclined to think that there is no validation performed using the schema, or even a check to see if the schema exists, so perhaps in that sense maybe a working schema is not required. But I think, if one is using extensions then one should have a properly constructed schema.
You're right about Pydantic's ability to spit out a schema coming in handy for this task and thereby reducing the barrier. But if one has to create a pydantic model anyways to generate a schema then I don't see why not just use that model rather than create another extension and then call on the pydantic model to generate the schema for use in the extension. At least in this particular case I don't see the point since the extension would not be doing anything more logically sophisticated than what a pydantic model does. Maybe in some other case, for a different type of data, one might want an extension to incorporate additional logic. But that decision can, and in my opinion, should be case by case.
(As a side note, it is my opinion that the way PySTAC has implemented STAC extensions is very non-pythonic which makes working with extensions a very unsatisfactory experience.)
I'm getting failures regarding the datacube extension.
Yeah, that extension still has to be debuged. It was not working for me from the start, and it is likely that a couple things might have broken during the rearrangement.
I tend to agree with your assessment, but being unfamiliar with STAC, I thought it was more prudent to architect stuff using existing pathways. In practice, maybe it doesn't change anything, I'm just not confident I know the answer to this question.
In any case, there are a couple of points that could be streamlined in a future PR.
serialization_alias
for every property seems overly verbose. I think we could just export the dict and then modify the keys to add the prefix before ingestion into the STAC item.
- Serializing numpy types. I suspect this could probably de done somewhere in xncml.
I agree.
- Setting the
serialization_alias
for every property seems overly verbose. I think we could just export the dict and then modify the keys to add the prefix before ingestion into the STAC item.
I have pushed a change with a new way in which the prefix is applied. Check it out and let me know your thoughts.
Looks good !
All typing
dict
orlist
should be more specific types withdict[<key>, <items>]
andlist[<item>]
notation.
All changes done, except in the implementations/CMIP6-UofT/extensions.py
file which can be done in another PR related to making the Datacube extension work.
All code in
implementations
should be moved underSTACpopulator
so that they can be imported externally. Right now, they are not installed in the same module, which makes it unusable by other code base that could reuse the implementations (eg: https://github.com/crim-ca/ncml2stac/blob/main/notebooks/ncml2stac.ipynb). For this notebook, I am particularly interested in avoiding to duplicate the logic inCMIP6populator.create_stac_item
.
Done
Please take a look at https://github.com/crim-ca/stac-populator/pull/23/files
Discussions, back and forth interactions and tests with Nazim led to the creation of two links, one for collections and one for items.
Please take a look at https://github.com/crim-ca/stac-populator/pull/23/files
Discussions, back and forth interactions and tests with Nazim led to the creation of two links, one for collections and one for items.
Thank you! I had forgotten about this other PR. I'll incorporate changes from there into here.
I agree with Nazim, but I suggest deployment and auth be split into different PRs.
I agree, it was meant as a next steps comment following the merge of this PR.
I agree that discussion on authentication should be taken up in a separate PR. But Nazim raises a timely issue that we'll have to discuss about together amongst ourselves very soon. One approach that we at Toronto have discussed amongst ourselves is to block all requests at the /stac
endpoint except for GET
at the proxy level to prevent anyone unauthorized from editing the catalog, and direct stac-populator
to access the STAC App on an internal hostname/port that is not accessible from the outside network. This approach would not necessitate the need to tie stac-populator
to magpie
and not clutter the code with authentication related logic. But we haven't thought it through to see if this is the best way to proceed, so it's just a prelim proposal. I can open an issue later today where we can discuss authentication.
to access the STAC App on an internal hostname/port that is not accessible from the outside network.
for this pr https://github.com/bird-house/birdhouse-deploy/pull/386 to work, request needs to pass through the proxy. Without that the hook
won't work. But yes we can discuss it later.
I think it would be best if we don't need to pass
$(IMP_DIR)/CMIP6-UofT/CMIP6.yml
in the python call. Sinceadd_CMIP6.py
is already in the same directory it should just use the.yml
in its directory.
Done :)
If no one has an objection, I will merge the PR.
If no one has an objection, I will merge the PR.
Give me a few minutes. I'm testing the new classes in my notebook.
@dchandan
I think the implementations
subdirs are missing __init__.py
files. They are not included in the site-packages dirs after installing it from this branch.
@dchandan I think the
implementations
subdirs are missing__init__.py
files. They are not included in the site-packages dirs after installing it from this branch.
Right.... Done.
Needs one in implementations
itself as well.
Should also exclude tests
from the install. Either by explicitly listing the exclusion in pyproject or by moving it up outside STACPopulator
.
I don't think python package / setuptools resolution like the names with hyphens and mixed lower/upper characters under implementations
.
This line: https://github.com/crim-ca/stac-populator/blob/c62fb801439b2d4900550e157df431984122654e/STACpopulator/implementations/CMIP6-UofT/add_CMIP6.py#L9 causes:
Cell In[11], line 13
10 import xncml
11 from pydantic.networks import Url
---> 13 from STACpopulator.implementations.CMIP6_UofT.add_CMIP6 import CMIP6populator
File ~/dev/daccs/ncml2stac/src/stacpopulator/STACpopulator/implementations/CMIP6_UofT/add_CMIP6.py:9
7 import pyessv
8 from colorlog import ColoredFormatter
----> 9 from extensions import DataCubeHelper
10 from pydantic import AnyHttpUrl, ConfigDict, Field, FieldValidationInfo, field_validator
12 from STACpopulator import STACpopulatorBase
ModuleNotFoundError: No module named 'extensions'
When installed, the extension is looked up in the root STACpopulator
instead of nested under the specific implementation.
It must be replaced by from STACpopulator.implementations.CMIP6_UofT.extensions import DataCubeHelper
I renamed CMIP6-UofT
-> CMIP6_UofT
to make other imports work.
Thanks for catching these @fmigneault! Hopefully all those issues are fixed now. Please confirm.
I've been trying to make the notebook work with latest changes. So far, no luck.
I basically need to hack my way around the CMIP6populator
, THREDDSLoader
and its underlying TDSCatalog
implementations since I'm trying to only convert a single NCML, not crawl the entire catalog. All of those automatically perform a lot of unnecessary __iters__
right from the moment __init__
are done. The way that NCML data is extracted from XML response, converted to a siphon.catalog.Dataset
, and passed down from THREDDSLoader.extract_metadata
to CMIP6populator.create_stac_item
are too thightly coupled. Although they are distinct classes, there is no easy way to swap loader implementations that attempt anything different.
The original implementation from the previous branch, although it replicated a few functions to shuffle some metadata , was much less convoluted.
The STACpopulatorBase
class also does some checks of YAML configs that are irrelevant for my use cases as well. I have to mock files just to avoid os.path.isfile
checks, or rewrite CMIP6populator.__init__
entirely.
Still not working attempt: https://github.com/crim-ca/ncml2stac/commit/9da080bbe912dc54654e5bab9d2b29880f8d027b
I'm trying to only convert a single NCML, not crawl the entire catalog.
I wonder if you are trying to do something that this piece of code was never meant to do.... I am usually not in favour of tightly coupled apps myself; I prefer to create frameworks that allow one to write solutions for different use cases. I think this does that for the purpose of creating STAC catalogs as it facilitates the main aspects of that workflow: (i) iterating over some input, (ii) producing an appropriate representation of STAC, and (iii) posting that representation to a server. Processing a single file was never the purpose, though I am absolutely not opposed to exploring if this additional use case can be added easily without much reworking (something I do not want to do at this point).
I am curious which branch you found more helpful in this regard. I don't think this structure has changed in a meaningful way through the development.
The STACpopulatorBase class also does some checks of YAML configs that are irrelevant for my use cases as well. I have to mock files just to avoid os.path.isfile checks, or rewrite CMIP6populator.init entirely.
Right, but again, this is there to serve a need directly associated with the purpose of this code. There needed to be a reasonable way to pass the various pieces of information about a new collection/catalog
to the code. The hardcoded config file helps alleviate the requirement to pass that information from the command line, and was suggested by Nazim above. And I think it helps.
Using branch https://github.com/Ouranosinc/stac-populator/tree/collection_link, the steps were very simple. I left the relevant code for that branch commented in the noteobok for comparison.
The steps used to generate the STAC Item were simply:
ds = xncml.Dataset(input_ncml)
input_ncml_data = ds.to_cf_dict()
stac_item_id = make_cmip6_item_id(attrs["attributes"])
attrs["id"] = stac_item_id
stac_item = CFJsonItemNetCDF(stac_item_id, attrs, cmip6.Properties)
DatacubeExt(stac_item)
c_item_data = stac_item.item.to_dict()
CFJsonItemNetCDF
essentially only added the rel: source
link on top of what CFJsonItem
already did.
The only way this could have been improved even more, was if CMIP6populator.create_stac_item
(or a similar method) did all these steps directly instead of starting with input_ncml_data
as argument. Basically, I would like to have xncml.Dataset.to_cf_dict()
abstracted away from the notebook as well, but worst case now would be to do:
ds = xncml.Dataset(input_ncml)
input_ncml_data = ds.to_cf_dict()
CMIP6populator.create_stac_item("name?", input_ncml_data)
Basically, what would be needed is to expose methods that allows to directly do:
THREDDSLoader.__iter__
THREDDSLoader.extract_metadata
CMIP6populator.ingest
CMIP6populator.create_stac_item
But right now, the intermediate objects needed for each step are obtained from __iter__
yield
of underlying catalog parsers.
To patch the objects and skip __iter__
that assign each object's attributes, one basically has to reimplement the whole processing chain.
@dchandan I now have a somewhat working version: https://github.com/crim-ca/ncml2stac/compare/main...5b3a3baabb41901b47b53d3e80e30479816ecb9c I've added comments that hopefully illustrate the kind of overrides I am refering to.
I see that stac_extensions
is still empty.
Also, DatacubeExt
is not applied anymore (no cube:variables
and cube:dimensions
properties).
See the last cell output diff.
Exception raised by datacube:
WARNING: [STACpopulator.implementations.CMIP6_UofT.add_CMIP6] Failed to add Datacube extension to item sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc
Traceback (most recent call last):
File "/home/francis/dev/daccs/ncml2stac/src/stacpopulator/STACpopulator/implementations/CMIP6_UofT/add_CMIP6.py", line 157, in create_stac_item
dc_ext = DatacubeExtension.ext(item, add_if_missing=True)
NameError: name 'DatacubeExtension' is not defined
Following are some changes that would greatly simplify the definition of a custom populator without hackish overrides:
Allow derived populators to pass down arguments to their data_loader
. For example, depth
would be useful here.
Could simply be a **data_loader_kwargs
to allow implementation to evolve easily with more parameters.
https://github.com/crim-ca/stac-populator/blob/10db1281b46298828727293159e7a5a52e71cb89/STACpopulator/implementations/CMIP6_UofT/add_CMIP6.py#L125-L135
Move those lines into a separate method. Something like get_collection_metadata
.
This way, an implementation that whiches to return the title
, description
, etc. without a file could do so easily by overriding it.
https://github.com/crim-ca/stac-populator/blob/10db1281b46298828727293159e7a5a52e71cb89/STACpopulator/populator_base.py#L47-L64
Do not perform these steps in __init__
, do them at the start of ingest
.
This way, a populator implementation that does not need these definitions (or what alternative ones) can override the relevant operations and have them called only when needed, not during object instantiation,
https://github.com/crim-ca/stac-populator/blob/10db1281b46298828727293159e7a5a52e71cb89/STACpopulator/populator_base.py#L66-L69
Require that GenericLoader
and their derived classes to implement __getitem__
, not only __iter__
, to obtain a specific item by item_name
directly.
In THREDDSLoader
, that would be equivalent to doing self.catalog_head.datasets.get()
.
A more fancy implementation could also accept a list of item_name
as argument, which would do self.catalog_head.datasets.get()
iteratively going deeper using:
https://github.com/crim-ca/stac-populator/blob/10db1281b46298828727293159e7a5a52e71cb89/STACpopulator/input.py#L84-L88
... but instead of iterating on yield
'd items, it would access them directly. That would make a huge difference if there are indeed thousands of NetCDF files as you mentionned in the STAC hook PR.
Assuming those are implemented, the sample notebook could simply do the following:
ncml_url = "https://svc.com/thredds/ncml/some/nested/netcdf.nc"
ncml_name = os.path.split(ncml_url)[-1]
catalog_url = "https://svc.com/thredds/catalog/some/nested/catalog.xml"
stac_host = "https://svc.com/stac"
cmip6_pop = CMIP6Populator(stac_host, catalog_url, depth=1)
ncml_data = cmip6_pop.get(ncml_name)
stac_item = cmip6_pop.create_stac_item(ncml_name, ncml_data)
That would make a very clean solution, and would make the populator flexible for collection vs single-item handling.
Note that handling single-item is necessary, for example, to update only specific items that we know have problematic metadata, without having to regenerate the whole catalog.
The steps used to generate the STAC Item were simply:
ds = xncml.Dataset(input_ncml) input_ncml_data = ds.to_cf_dict() stac_item_id = make_cmip6_item_id(attrs["attributes"]) attrs["id"] = stac_item_id stac_item = CFJsonItemNetCDF(stac_item_id, attrs, cmip6.Properties) DatacubeExt(stac_item) c_item_data = stac_item.item.to_dict()
Well you can still so something nearly identical:
ds = xncml.Dataset(input_ncml)
input_ncml_data = ds.to_cf_dict()
stac_item_id = make_cmip6_item_id(attrs["attributes"])
attrs["id"] = stac_item_id
item = STAC_item_from_metadata(stac_item_id, stac_item_id, CMIP6ItemProperties, GeoJSONPolygon)
# some appropriate Datacube stuff
@dchandan I now have a somewhat working version: crim-ca/ncml2stac@main...5b3a3ba I've added comments that hopefully illustrate the kind of overrides I am refering to.
I see that
stac_extensions
is still empty. Also,DatacubeExt
is not applied anymore (nocube:variables
andcube:dimensions
properties). See the last cell output diff.Exception raised by datacube:
WARNING: [STACpopulator.implementations.CMIP6_UofT.add_CMIP6] Failed to add Datacube extension to item sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc Traceback (most recent call last): File "/home/francis/dev/daccs/ncml2stac/src/stacpopulator/STACpopulator/implementations/CMIP6_UofT/add_CMIP6.py", line 157, in create_stac_item dc_ext = DatacubeExtension.ext(item, add_if_missing=True) NameError: name 'DatacubeExtension' is not defined
Yes, this is a work in progress. It was mentioned earlier in the PR.
Well you can still so something nearly identical:
Not exactly. The links are not generated the same way. This is why I was waiting on this implementation. See how the service names are defined in the diff of notebook output.
Also, I would rather have all the parsing logic in the stac-populator
, this way we can be sure the result is identical on both ends.
It is preferable to have 1 reference implementation on converting NCML to STAC.
The notebook is only to demonstrate how to make a deployable CWL for Weaer directly from a notebook on Github.
It would preferably not have any custom logic related to the NCML to STAC convertion. This is just the use case I happened to pick to demonstrate the feature.
Also, I would rather have all the parsing logic in the stac-populator, this way we can be sure the result is identical on both ends.
I don't know what both ends
means. I don't follow your use case. The NCML to STAC conversion is a workflow that happens on the server side initiated by the admin to populate the STAC database. That code here does that job properly (or are you saying that there is a deficiency in the STAC representation it creates, which would be useful to know).
We can chat about what exactly you are trying to do tomorrow over a zoom call.
I think I've figured out what's going on. The issue is with the idiosyncrasies of how the THREDDS server returns metadata based on how the metadata is queried. We can discuss this on the call today.
@fmigneault I've made changes to allow the kinds of workflow we discussed last Friday. Please check it out. The creation of a single STAC item can be tested using tests/test_standalone_stac_item.py
.
You had commented earlier that:
The steps used to generate the STAC Item were simply:
ds = xncml.Dataset(input_ncml) input_ncml_data = ds.to_cf_dict() stac_item_id = make_cmip6_item_id(attrs["attributes"]) attrs["id"] = stac_item_id stac_item = CFJsonItemNetCDF(stac_item_id, attrs, cmip6.Properties)
This does exactly that. I am not following the objection now....
This does exactly that. I am not following the objection now....
Indeed. But as we discussed in our 1-on-1 call, I would rather have a single operation without any specific logic related to generating the attributes from the NCML and for the STAC Item. What the notebook did before was not "better", just easier with less overrides, but it was still a temporary implementation until the logic is entirely self-contained by stac-populator.
This does exactly that. I am not following the objection now....
Indeed. But as we discussed in our 1-on-1 call, I would rather have a single operation without any specific logic related to generating the attributes from the NCML and for the STAC Item. What the notebook did before was not "better", just easier with less overrides, but it was still a temporary implementation until the logic is entirely self-contained by stac-populator.
Okay. I will leave it for a later refactoring of code. Right now, I need to get this thing working and deploy it for our use.
@fmigneault Any idea why this "https://psl.noaa.gov/thredds/ncml/Datasets/NARR/Dailies/pressure/air.197901.nc?catalog=http://psl.noaa.gov/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.html&dataset=Datasets/NARR/Dailies/pressure/air.197901.nc" catalog does not contain the same information as "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc?catalog=https%3A%2F%2Fpavics.ouranos.ca%2Ftwitcher%2Fows%2Fproxy%2Fthredds%2Fcatalog%2Fbirdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fcatalog.html&dataset=birdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fsic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"?
Both have query parameters, but they behave differently. While the Ouranos one gives service URLs, the other one does not.
Maybe it is the versions?
https://psl.noaa.gov/thredds/catalog/catalog.xml reports version="1.2" https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog.xml reports
version="1.0.1"`
Otherwise, maybe the contents of the NetCDF themselves differ. I am not familiar enough with the metdata those files contain. @huard ?
Maybe it is a version thing.
That complicates things. I can't use the query parameter based link anymore because different servers may behave differently. I will revert back to using access links by opening the catalog itself using siphon. My tests show that way works with both of these servers.
Check if you can get the same results by overriding siphon.catalog.TDSCatalog._process_dataset
to a no-op method, or something that limits how much crawling it does on __init__
.
If the THREDDS server has a lot of datasets, it starts parsing everything right away, which is quickly inefficient if the purpose is only to describe a single NCML file or a subset of a specific dataset.
I tried with siphon.catalog.TDSCatalog("psl.noaa.gov/thredds/catalog/catalog.xml")
and it takes some time to load 3759 datasets...
Even using siphon.catalog.TDSCatalog
, I think you will need the query parameter because the URL of the NCML itself ("pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc) does not resolve the reference of its parent dataset.
So I don't understand why there are differences with and without query parameters. I found out empirically by inspecting the URL of the NcML service exposed in the catalog web page, and have been using it since because the response contains more information.
Closed as this was getting too long. Will revisit outstanding issues in future PRs.
Here is my proposal for finalizing the architecture of STAC populator. I have chosen to branch from
arch-changes
rather than overlay my changes to that branch, so once we finalize this PR we can merge it back intoarch-changes
and close PR #16.There is currently only one implementation, for CMIP6 data residing on the UofT node. We expect more implementations to follow soon. Ouranos will probably want to have an implementation for CMIP6 data on the Ouranos node. I am also working on two other implemetations, one for the NEX-GDDP data and one that populates our catalog with data from other catalogs (this implemetation can probably be used verbatim by more than one Marble node.)
Key changes:
Extracting pydantic models into a separate file I moved the Pydantic models from
stac_utils.py
to a newmodels.py
file.Removing cmip6 extension After examination of the code I came to be of the opinion that the cmip6 extension was not adding much value and could be removed. Here's why:
create_stac_item
all appropriate CMIP6 properties were already added to thepystac.Item
instance from the Pydantic data model that describes the CMIP6 Item Properties (CMIP6ItemProperties
inadd_CMIP6.py
). Since the extension was simply adding properties to the Item that were already being added by the Pydantic model, in its form it was not adding any additional value.Breaking up
CFJsonItem
I felt that theCFJsonItem
was doing two things that didn't quite fit together within one class. On the one hand it contained logic for the creation of a newpystac.Item
object from CF metadata and on the other hand there was logic that was specifically for use by thepystac.extensions.datacube
extension. The former is applicable to all items that are created based on crawling the THREDDS catalog, but thedatacube
extension code is not applicable to all STAC items because the extension itself is not applicable to all types of data (or maybe one doesn't want to use the extension even for data to which it is applicable). Therefore, having the extension logic in that class that gets instantiated for every catalog item didn't make sense to me.So, I dissolved
CFJsonItem
and extracted its logic as follows: code that pertains to the creation of STAC item (more specifically apystac.Item
class representing a STAC item) is in a new function instac_utils.py
calledSTAC_item_from_metadata
while code meant to support thedatacube
extension is now in a new fileextensions.py
in theCMIP6-UofT
implementations folder. I moved that code to the implementations folder rather than in the core section because I am unsure how generally applicable this code would be to other types of model output data, but this could be changed.