datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
15 stars 12 forks source link

update documentation regarding metadata extraction #244

Closed Remi-Gau closed 3 months ago

Remi-Gau commented 1 year ago

Was trying to follow this example

http://docs.datalad.org/projects/catalog/en/latest/pipeline_description.html

Running this

DATASET_PATH="path/to/mydataset"
PIPELINE_PATH="path/to/extract_dataset_pipeline.json"
datalad meta-conduct "$PIPELINE_PATH" \
    traverser:"$DATASET_PATH" \
    traverser:dataset \
    traverser:True \
    extractor1:Dataset \
    extractor1:metalad_core \
    extractor2:Dataset \
    extractor2:metalad_studyminimeta \
    adder:True

with the suggested pipeline

{
  "provider": {
    "module": "datalad_metalad.provider.datasettraverse",
    "class": "DatasetTraverser",
    "name": "traverser",
    "arguments": [],
    "keyword_arguments": {}
  },
  "processors": [
    {
      "module": "datalad_metalad.processor.extract",
      "class": "MetadataExtractor",
      "name": "extractor1",
      "arguments": [],
      "keyword_arguments": {}
    },
    {
      "module": "datalad_metalad.processor.extract",
      "class": "MetadataExtractor",
      "name": "extractor2",
      "arguments": [],
      "keyword_arguments": {}
    },
    {
      "name": "adder",
      "module": "datalad_metalad.processor.add",
      "class": "MetadataAdder",
      "arguments": [],
      "keyword_arguments": {}
    }
  ]
}

gives this error

[ERROR  ] No module named 'datalad_metalad.provider'

Checking on the structure of the package:

https://github.com/datalad/datalad-metalad/tree/master/datalad_metalad/pipeline/provider

it seems that the json should have things like

datalad_metalad.pipeline.processor.extract

Remi-Gau commented 1 year ago

Also the command too should probably be updated to something like this

DATASET_PATH="path/to/mydataset"
PIPELINE_PATH="path/to/extract_dataset_pipeline.json"
datalad meta-conduct "$PIPELINE_PATH" \
    traverser.top_level_dir=$DATASET_PATH \
    traverser.item_type=dataset \
    traverser.traverse_sub_datasets=True \
    extractor1.extractor_type=dataset \
    extractor1.extractor_name=metalad_core \
    extractor2.extractor_type=dataset \
    extractor2.extractor_name=metalad_studyminimeta \
    adder.aggregate=True
Remi-Gau commented 1 year ago

Though when running it with this I get:

[ERROR  ] 'list' object is not a mapping 
Remi-Gau commented 1 year ago

and when relying on one of the example from metalad I get this:

datalad meta-conduct \
  extract_metadata \
  traverser.top_level_dir=$DATASET_PATH \
  traverser.item_type=file \
  traverser.traverse_sub_datasets=True \
  extractor.extractor_type=file \
  extractor.extractor_name=metalad_example_file \
  adder.aggregate=True

[ERROR  ] A child process terminated abruptly, the process pool is not usable anymore
jsheunis commented 1 year ago

@Remi-Gau So sorry that I missed this! not sure how that happened...

I will try and reproduce this, and update docs to the latest correct functionality.

jsheunis commented 1 year ago

@Remi-Gau I think the error you got:

[ERROR ] 'list' object is not a mapping

was because the value of the arguments keyword should be a dictionary rather than a list. So the pipeline object should rather be:

{
  "provider": {
    "module": "datalad_metalad.provider.datasettraverse",
    "class": "DatasetTraverser",
    "name": "traverser",
    "arguments": {}
  },
  "processors": [
    {
      "module": "datalad_metalad.processor.extract",
      "class": "MetadataExtractor",
      "name": "extractor1",
      "arguments": {}
    },
    {
      "module": "datalad_metalad.processor.extract",
      "class": "MetadataExtractor",
      "name": "extractor2",
      "arguments": {}
    },
    {
      "name": "adder",
      "module": "datalad_metalad.processor.add",
      "class": "MetadataAdder",
      "arguments": {}
    }
  ]
}

This worked for me.

I updated the docs accordingly in this PR: https://github.com/datalad/datalad-catalog/pull/258

Let me know if this solves your issue.

jsheunis commented 3 months ago

Closed by https://github.com/datalad/datalad-catalog/pull/258