abcd-j / data-catalog

https://data.abcd-j.de
0 stars 1 forks source link

DOI lookup is not yet possible in catalog-entry-generation code #13

Open tmheunis opened 6 months ago

tmheunis commented 6 months ago

I'm processing a new entry to the catalog and getting an error:

> python3 code/process_subdirectory.py data UKD/ocr-PIRA-cohort --dataset-type other --add-to-catalog

/Users/theunis/virtualenvs/abcdj-catalog/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
Traceback (most recent call last):
  File "/Users/theunis/Documents/psyinf/abcdj/data-catalog/code/process_subdirectory.py", line 39, in <module>
    subds_tabby_records = get_tabby_metadata(
  File "/Users/theunis/Documents/psyinf/abcdj/data-catalog/code/get_tabby_metadata.py", line 159, in get_tabby_metadata
    cat_file = file_required_meta | process_file(file_info)
  File "/Users/theunis/Documents/psyinf/abcdj/data-catalog/code/utils.py", line 206, in process_file
    "path": f.get("path", {}).get("@value"),
AttributeError: 'str' object has no attribute 'get'
(abcdj-catalog) ➜  data-catalog git:(main) ✗ python3 code/process_subdirectory.py data UKD/ocr-PIRA-cohort --dataset-type other --add-to-catalog
/Users/theunis/virtualenvs/abcdj-catalog/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
add(ok): .datalad/tabby/self/subdatasets@tby-abcdjv0.tsv (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
catalog_add(ok): /Users/theunis/Documents/psyinf/abcdj/data-catalog/catalog [Metadata record successfully added to catalog (dataset: dataset_id=1015ed7c-0a3d-4dfc-9c4f-11fe71673a41, dataset_version=0fbad673d51cc7d421b53d054d90e0184fd07530)]
catalog_add(ok): /Users/theunis/Documents/psyinf/abcdj/data-catalog/catalog [Metadata record successfully updated in catalog (dataset: dataset_id=1015ed7c-0a3d-4dfc-9c4f-11fe71673a41, dataset_version=0fbad673d51cc7d421b53d054d90e0184fd07530)]
catalog_add(error): /Users/theunis/Documents/psyinf/abcdj/data-catalog/catalog [Schema validation failed in LINE 1:

'title' is a required property

Failed validating 'required' in schema['allOf'][0]['then']['properties']['publications']['items']:
    {'properties': {'authors': {'$ref': 'https://datalad.org/catalog.authors.schema.json'},
                    'datePublished': {'description': 'The publication date '
                                                     'year',
                                      'title': 'Date published',
                                      'type': ['number', 'string']},
                    'doi': {'description': "The publication's digital "
                                           'object identifier',
                            'title': 'DOI',
                            'type': 'string'},
                    'publicationOutlet': {'description': 'The publication '
                                                         'outlet / venue, '
                                                         'such as the '
                                                         'journal, '
                                                         'publisher name, '
                                                         'or news outlet',
                                          'title': 'Publication outlet',
                                          'type': 'string'},
                    'title': {'description': 'Title of the publication',
                              'title': 'Title',
                              'type': 'string'},
                    'type': {'description': 'Type of publication, such as '
                                            'a scholarly article, book, '
                                            'blog post',
                             'title': 'Type',
                             'type': 'string'}},
     'required': ['title', 'doi', 'authors'],
     'type': 'object'}

On instance['publications'][0]:
    {'@type': 'schema:CreativeWork',
     'datePublished': '2023-09-11',
     'doi': 'https://doi.org/10.1038/s41598-023-40940-w',
     'schema:url': 'https://www.nature.com/articles/s41598-023-40940-w'}]
Traceback (most recent call last):
  File "/Users/theunis/Documents/psyinf/abcdj/data-catalog/code/process_subdirectory.py", line 153, in <module>
    catalog_add(
  File "/Users/theunis/virtualenvs/abcdj-catalog/lib/python3.9/site-packages/datalad/interface/base.py", line 773, in eval_func
    return return_func(*args, **kwargs)
  File "/Users/theunis/virtualenvs/abcdj-catalog/lib/python3.9/site-packages/datalad/interface/base.py", line 763, in return_func
    results = list(results)
  File "/Users/theunis/virtualenvs/abcdj-catalog/lib/python3.9/site-packages/datalad_next/patches/interface_utils.py", line 287, in _execute_command_
    raise IncompleteResultsError(
datalad.support.exceptions.IncompleteResultsError: Command did not complete successfully. 1 failed:
[{'action': 'catalog_add',
  'exception': <ValidationError: "'title' is a required property">,
  'exception_traceback': '[add.py:__call__:199,validators.py:validate:438]',
  'message': 'Schema validation failed in LINE 1: \n'
             '\n'
             "'title' is a required property\n"
             '\n'
             "Failed validating 'required' in "
             "schema['allOf'][0]['then']['properties']['publications']['items']:\n"
             "    {'properties': {'authors': {'$ref': "
             "'https://datalad.org/catalog.authors.schema.json'},\n"
             "                    'datePublished': {'description': 'The "
             "publication date '\n"
             "                                                     'year',\n"
             "                                      'title': 'Date "
             "published',\n"
             "                                      'type': ['number', "
             "'string']},\n"
             '                    \'doi\': {\'description\': "The '
             'publication\'s digital "\n'
             "                                           'object identifier',\n"
             "                            'title': 'DOI',\n"
             "                            'type': 'string'},\n"
             "                    'publicationOutlet': {'description': 'The "
             "publication '\n"
             "                                                         'outlet "
             "/ venue, '\n"
             "                                                         'such "
             "as the '\n"
             '                                                         '
             "'journal, '\n"
             '                                                         '
             "'publisher name, '\n"
             "                                                         'or "
             "news outlet',\n"
             "                                          'title': 'Publication "
             "outlet',\n"
             "                                          'type': 'string'},\n"
             "                    'title': {'description': 'Title of the "
             "publication',\n"
             "                              'title': 'Title',\n"
             "                              'type': 'string'},\n"
             "                    'type': {'description': 'Type of "
             "publication, such as '\n"
             "                                            'a scholarly "
             "article, book, '\n"
             "                                            'blog post',\n"
             "                             'title': 'Type',\n"
             "                             'type': 'string'}},\n"
             "     'required': ['title', 'doi', 'authors'],\n"
             "     'type': 'object'}\n"
             '\n'
             "On instance['publications'][0]:\n"
             "    {'@type': 'schema:CreativeWork',\n"
             "     'datePublished': '2023-09-11',\n"
             "     'doi': 'https://doi.org/10.1038/s41598-023-40940-w',\n"
             "     'schema:url': "
             "'https://www.nature.com/articles/s41598-023-40940-w'}",
  'path': PosixPath('/Users/theunis/Documents/psyinf/abcdj/data-catalog/catalog'),
  'status': 'error'}]

Looks like the issue is 'title' is a required property, which is related to catalog schema validation, but the origin of the problem is that the tabby record for a publication does not have a citation. In the user docs we state that a citation is optional if a DOI is supplied (which makes sense from a user-perspective), but it looks like the code does not yet handle this scenario, see https://github.com/abcd-j/data-catalog/blob/main/code/utils.py#L86-L118.

jsheunis commented 6 months ago

Thanks for picking this up. I agree, this should be added to the catalog-generation code.