Public-nEUro / DataCatalogue

lists datasets available in the PublicnEUro brain imaging repository
https://publicneuro-catalogue.netlify.app/
Creative Commons Zero v1.0 Universal
0 stars 1 forks source link

datalad catalog-validate #1

Closed CPernet closed 2 months ago

CPernet commented 2 months ago

errors returned as soom as it reads metadata @jsheunis help please -- json looks fine to my eyes but the validator complains :-(

datalad catalog-validate --metadata OpenNeuroPET_phantoms.json catalog_validate(error): /indirect/openneuropet/DataCatalogue/PublicnEUro/metadata/PET-Phantoms [Expecting property name enclosed in double quotes: line 1 column 20 (char 19)] catalog_validate(error): /indirect/openneuropet/DataCatalogue/PublicnEUro/metadata/PET-Phantoms [Extra data: line 1 column 10 (char 9)] catalog_validate(error): /indirect/openneuropet/DataCatalogue/PublicnEUro/metadata/PET-Phantoms [Extra data: line 1 column 16 (char 15)] catalog_validate(error): /indirect/openneuropet/DataCatalogue/PublicnEUro/metadata/PET-Phantoms [Extra data: line 1 column 15 (char 14)] catalog_validate(error): /indirect/openneuropet/DataCatalogue/PublicnEUro/metadata/PET-Phantoms [Extra data: line 1 column 20 (char 19)] catalog_validate(error): /indirect/openneuropet/DataCatalogue/PublicnEUro/metadata/PET-Phantoms [Extra data: line 1 column 8 (char 7)] catalog_validate(error): /indirect/openneuropet/DataCatalogue/PublicnEUro/metadata/PET-Phantoms [Extra data: line 1 column 8 (char 7)] catalog_validate(error): /indirect/openneuropet/DataCatalogue/PublicnEUro/metadata/PET-Phantoms [Extra data: line 1 column 13 (char 12)] catalog_validate(error): /indirect/openneuropet/DataCatalogue/PublicnEUro/metadata/PET-Phantoms [Extra data: line 1 column 12 (char 11)] catalog_validate(error): /indirect/openneuropet/DataCatalogue/PublicnEUro/metadata/PET-Phantoms [Extra data: line 1 column 11 (char 10)] [42 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off] action summary: catalog_validate (error: 52)

jsheunis commented 2 months ago

The first error looks like it gives an understandable description: Expecting property name enclosed in double quotes. But I haven't seen the rest before, looks like the metadata might contain extra fields that the validator sees as an error.

Is it possible for you to post/share the metadata that you are trying to validate? I.e. the content of the OpenNeuroPET_phantoms.json file?

CPernet commented 2 months ago

oh boy I forgot to link to the json .. here https://github.com/Public-nEUro/DataCatalogue/blob/master/metadata/PET-Phantoms/OpenNeuroPET_phantoms.json (euh yeah double quote, thx pal)

jsheunis commented 2 months ago

The problem is the format in which the metadata comes in. Your example has a json object spread over multiple lines of the file, and the catalog-validate command can take the following formats:

You can either change the file to a single line, or read it in on the command line and pass the line to catalog-validate. This should also work:

datalad catalog-validate -c . -m '{ "type": "dataset", "title": "OpenNeuroPET Phantoms", "description": "The PET Brain phantoms dataset is curated by OpenNeuroPET. This repository contains source data from PET scanners that need to be converted to nifti format, preferably following BIDS. An issue with ecat and DICOM data is that many tags are not mandatory and many values are not standardized making it difficult to harmonize the outputs of conversion, in particular side car json files. Here we collected many phantoms from different sites and scanners allowing to check the different header tags and values. This also allows validated conversion tools. Such test is performed via the code folder where one use the PET2BIDS library to do the conversion (which depends on dcm2niix for DICOM). Feel free to add to this with your own tool and submit the request to update the dataset.", "dataset_id": "datalad_hash_to_use", "dataset_version": "V1", "doi": "xxxx", "url": "xxxx", "keywords": [ "Positron Emission Tomography", "PET", "Brain", "Source files", "ecat7", "DICOM", "Phantom", "Conversion", "PET2BIDS", "dcm2niix" ], "license": { "name": "CC BY 4.0", "url": "https://creativecommons.org/licenses/by/4.0/" }, "authors": [ { "givenName": "Cyril", "familyName": "Pernet" }, { "givenName": "Sune", "familyName": "Høgild Kelle" }, { "givenName": "Gabriel", "familyName": "Gonzalez-Escamilla" }, { "givenName": "Søren", "familyName": "Baarsgaard Hansen" }, { "givenName": "Maqsood", "familyName": "Yaqub" }, { "givenName": "Murat ", "familyName": "Bilgel" } ], "metadata_sources": { "key_source_map": {}, "sources": [ { "source_name": "OpenNeuroPET", "source_version": "1", "source_parameter": {}, "agent_name": "Cyril Pernet", "agent_email": "" } ] } }'

You will see an error when you run that, but that is because of a different validation issue (I received this error when running your code locally):

[Schema validation failed in LINE 1:

{} is not of type 'number'

Failed validating 'type' in schema['allOf'][0]['then']['properties']['metadata_sources']['properties']['sources']['items']['properties']['source_time']:
    {'description': 'The time (since epoch) when this source was used to '
                    'provide the applicable metadata',
     'title': 'Source time',
     'type': 'number'}

On instance['metadata_sources']['sources'][0]['source_time']:
    {}]

The issue here is that source_time should be a number and not {}. You can solve it in this case by removing the field from it's containing object, since it is not a required field.

CPernet commented 2 months ago

Hi StephanThank you so much for helping. I'll get back to it tomorrow (fully booked today 😔). Just a quick feedback from the documentation. I did read 'a path to a file containing JSON lines' but I understood this as give the path to a file. Does it means to a serialised json and not indented? I think you should state that so it is explicit. Where it is confusing is that the exemple is indented (like my file) so I expected it would work (fair enough I did not try your exemple...). Maybe an exemple of each option would make it easier. (Happy to PR those if you want, since I have to try and learning it)Again, thx man - it's gonna be great to have a EU repo with datalad and datacat! Sent from my phone-------- Original message --------From: Stephan Heunis @.>Date: Mon, 22 Apr 2024, 18.17To: Public-nEUro/DataCatalogue @.>Cc: Cyril Pernet @.>, Author @.>Subject: Re: [Public-nEUro/DataCatalogue] datalad catalog-validate (Issue #1) The problem is the format in which the metadata comes in. Your example has a json object spread over multiple lines of the file, and the catalog-validate command can take the following formats: a path to a file containing JSON linesJSON lines from STDINa JSON serialized string. You can either change the file to a single line, or read it in on the command line and pass the line to catalog-validate. This should also work: datalad catalog-validate -c . -m '{ "type": "dataset", "title": "OpenNeuroPET Phantoms", "description": "The PET Brain phantoms dataset is curated by OpenNeuroPET. This repository contains source data from PET scanners that need to be converted to nifti format, preferably following BIDS. An issue with ecat and DICOM data is that many tags are not mandatory and many values are not standardized making it difficult to harmonize the outputs of conversion, in particular side car json files. Here we collected many phantoms from different sites and scanners allowing to check the different header tags and values. This also allows validated conversion tools. Such test is performed via the code folder where one use the PET2BIDS library to do the conversion (which depends on dcm2niix for DICOM). Feel free to add to this with your own tool and submit the request to update the dataset.", "dataset_id": "datalad_hash_to_use", "dataset_version": "V1", "doi": "xxxx", "url": "xxxx", "keywords": [ "Positron Emission Tomography", "PET", "Brain", "Source files", "ecat7", "DICOM", "Phantom", "Conversion", "PET2BIDS", "dcm2niix" ], "license": { "name": "CC BY 4.0", "url": "https://creativecommons.org/licenses/by/4.0/" }, "authors": [ { "givenName": "Cyril", "familyName": "Pernet" }, { "givenName": "Sune", "familyName": "Høgild Kelle" }, { "givenName": "Gabriel", "familyName": "Gonzalez-Escamilla" }, { "givenName": "Søren", "familyName": "Baarsgaard Hansen" }, { "givenName": "Maqsood", "familyName": "Yaqub" }, { "givenName": "Murat ", "familyName": "Bilgel" } ], "metadata_sources": { "key_source_map": {}, "sources": [ { "source_name": "OpenNeuroPET", "source_version": "1", "source_parameter": {}, "agent_name": "Cyril Pernet", "agent_email": "" } ] } }'

You will see an error when you run that, but that is because of a different validation issue (I received this error when running your code locally): [Schema validation failed in LINE 1:

{} is not of type 'number'

Failed validating 'type' in schema['allOf'][0]['then']['properties']['metadata_sources']['properties']['sources']['items']['properties']['source_time']: {'description': 'The time (since epoch) when this source was used to ' 'provide the applicable metadata', 'title': 'Source time', 'type': 'number'}

On instance['metadata_sources']['sources'][0]['source_time']: {}]

The issue here is that source_time should be a number and not {}. You can solve it in this case by removing the field from it's containing object, since it is not a required field.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

jsheunis commented 2 months ago

Thanks for the feedback. JSON Lines is a known format where a single line is a single json object with no indentation. Perhaps it's a good idea to add a link in the doc to the json lines format. And I agree, if something is confusing in the docs, that's a reason to improve it. You are very welcome to make a PR, it would be much appreciated :)

Another note, I forgot to mention that the catalog_serve command, and any other commands taking metadata as an argument, can also accept Python dictionaries when working via the Python api.

CPernet commented 2 months ago

success :-) -- just the 'title' does not render, "title": "OpenNeuroPET Phantoms" yet it is there in https://github.com/Public-nEUro/DataCatalogue/blob/master/metadata/datalad_hash_to_use/V1/049/a1078084a97532a48a316e5bb30cd.json

image

jsheunis commented 2 months ago

There isn't actually a title property specified in the dataset schema. I suspect it doesn't give you a warning or error about this since extra properties aren't explicitly prohibited. But if you rather populate the name field, I think it should display correctly.

CPernet commented 2 months ago

indeed, solved! thx