clowder-framework / clowder2

Clowder v2 (in development)
Apache License 2.0
12 stars 6 forks source link

Register automatic execution per dataset #749

Open lmarini opened 1 year ago

lmarini commented 1 year ago

We currently register extractors to run on dataset and files at the dataset level. It would be useful to be able to register an extractor to automatically run for a specific datasets or file. Similar to what we do in v1 with spaces, but at the dataset level.

Some extractors have very specific uses and only work with very specific datasets. For example when setting up and automatic ingestion pipeline from sensors in the field, we might want to concatenate files or clean those up in a very specific way the moment they hit Clowder. This would be one way.

max-zilla commented 1 year ago

When the time comes, the elasticsearch documents are organized by ID and files store the dataset_id in their document.

Thus, one should currently be able to POST something like:

feed_example = {
    "name": "My Dataset Extractor",
    "search": {
        "index_name": "clowder",
        "criteria": [{"field": "_id", "operator": "==", "value": "DATASET_ID"}],
    },
}

We might need to expand on this a bit, but the basic functionality to associate with an individual dataset or even a specific file when it is updated are possible