cedadev / search-futures

Future Search Architecture
BSD 2-Clause "Simplified" License
0 stars 0 forks source link

Create a wrapper module for esgf search #158

Open agstephens opened 2 years ago

agstephens commented 2 years ago

ACTIONS

from esgf_stac_client import ESGFStacClient

stac_api_url = "http://api.stac.ceda.ac.uk"

ec = ESGFStacClient(stac_api_url)

# List the projects/activities in the STAC catalog
colls = ec.get_collections()
print(colls)

# Do faceted search at dataset (items) level
result = ec.search(
    doctype="dataset", # "dataset" is our alias for STAC "item"
    source_id=["model1", "model2", "model3"], #  matches any
    experiment="historical",
    variable=["tasmax", "tasmin"],
    datetime="2022-01-01/..",
    bbox=...
)

print(results.matched()) # number of hits
18999

# result is a generator
dset = next(result)
print(dset)

for record in result: print(record)
...

# For a given Item (Dataset), you can get the assets (files)
print(item.get_assets())

# Get a list of assets from a list of items
print([asset for asset in item.get_assets() for item in items])

# Get the files (assets) associated with the dataset
dset_id = dset.id
result = ec.search(
    doctype="file", # "file" is our alias for STAC "asset"
    item=[item_id]  # can provide items or item IDs here
)

print(results.matched()) # number of hits
22

# result is a generator
file = next(result)
print(file)
print(file.url) # Get URL to file

# Search the files in a dataset by datetime
result = ec.search(
    doctype="file", # "file" is our alias for STAC "asset"
    item=[item],  # can provide items or item IDs here
    datetime="2300-01-01T00:00:00Z/2800-12-01T00:00:00.000Z"
)

# Search files (assets) across everything in CMIP6
result = ec.search(
    doctype="file", # "file" is our alias for STAC "asset"
    collection=["CMIP6", "CMIP5"] # Collection is the top-level name for project/activity
)

More complex queries are possible

Use "filter" kwarg.

NOTE: we will not expose the dictionary-type "filter" examples. We will only demonstrate the "cql2-text" approach.

NOTE: "filter" dictionaries are provided so that "POST" queries will work (with JSON payload).

Users might want to build more complex queries, combining:

AND
OR
NOT
WITHIN
>
<

Example:

result = ec.search(
    doctype="dataset", # "dataset" is our alias for STAC "item",
    q="precip*",  # will match any facet value matching that free text
    filter="source_id=model1 OR source=model2 AND experiment=historical", # Introduce CQL query language
    datetime="2022-01-01/.."
)

How does "filter" interact with facet kwargs?

Default behaviour should be:

Note about pagination

The pystac client abstracts away pagination by returning a generator object.

ESGF Search API

Current docs: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html

The following keywords are currently used by the system - see later for usage examples:

Mahir-Sparkess commented 2 years ago

To select a subset of fields you can add the source API to the queryset in stac-fastapi-elasticsearch, with the .source class method from elasticsearch_dsl.

However this raises a STAC error at the API level, the get around this the source must include datetime properties, the end result will give you what you want + datetimes + relational links.

Mahir-Sparkess commented 2 years ago

https://github.com/cedadev/esgf-stac-client