Open lmarini opened 1 year ago
When the time comes, the elasticsearch documents are organized by ID and files store the dataset_id in their document.
Thus, one should currently be able to POST something like:
feed_example = {
"name": "My Dataset Extractor",
"search": {
"index_name": "clowder",
"criteria": [{"field": "_id", "operator": "==", "value": "DATASET_ID"}],
},
}
We might need to expand on this a bit, but the basic functionality to associate with an individual dataset or even a specific file when it is updated are possible
We currently register extractors to run on dataset and files at the dataset level. It would be useful to be able to register an extractor to automatically run for a specific datasets or file. Similar to what we do in v1 with spaces, but at the dataset level.
Some extractors have very specific uses and only work with very specific datasets. For example when setting up and automatic ingestion pipeline from sensors in the field, we might want to concatenate files or clean those up in a very specific way the moment they hit Clowder. This would be one way.