gdcc / easyDataverse

🪐 - Lightweight Dataverse interface in Python to upload, download and update datasets found in Dataverse installations.
MIT License
16 stars 5 forks source link

error when loading dataset with date format "yyyy" #32

Closed kbrueckmann closed 1 month ago

kbrueckmann commented 1 month ago

I'm updating files in a dataset without touching anything else. The dataset has a set "time period" in its metadata with these values:

Start Date: 1594 End Date: 1636

When loading the dataset they apparently lead to a ValidationError (I assume because only a year is given):

File "venv/lib/python3.12/site-packages/easyDataverse/dataverse.py", line 315, in load_dataset self._construct_block_classes(blocks, dataset) File "venv/lib/python3.12/site-packages/easyDataverse/dataverse.py", line 416, in _construct_block_classes dataset.metadatablocks[name] = metadatablock.class.model_validate( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "venv/lib/python3.12/site-packages/pydantic/main.py", line 596, in model_validate return cls.__pydantic_validator__.validate_python( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pydantic_core._pydantic_core.ValidationError: 2 validation errors for Citation time_period_covered.0.start Datetimes provided to dates should have zero time - e.g. be exact dates [type=date_from_datetime_inexact, input_value='1594', input_type=str] For further information visit https://errors.pydantic.dev/2.9/v/date_from_datetime_inexact time_period_covered.0.end Datetimes provided to dates should have zero time - e.g. be exact dates [type=date_from_datetime_inexact, input_value='1636', input_type=str] For further information visit https://errors.pydantic.dev/2.9/v/date_from_datetime_inexact

Is there any way to change that behavior?

kbrueckmann commented 1 month ago

Forgot to mention: I cannot change the date to something like "01.01.1594" because Dataverse won't accept that in that field. Otherwise I get this error message: Time Period Start Date is not a valid date. "yyyy" is a supported format.

pdurbin commented 1 month ago

Interesting. Indeed I can enter these values fine at https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/RN55IT

Screenshot 2024-10-15 at 4 25 53 PM

kbrueckmann commented 1 month ago

Yes, entering them is no problem. What happens if you now try to load this dataset via the load_dataset()-function?

JR-1991 commented 1 month ago

@kbrueckmann, thank you for bringing up this issue! It’s a known limitation with Python’s date module when used with pydantic, as it requires a full date and doesn’t support year-only entries.

There’s an open PR (#27) that resolves this by reverting to a str input. Due to the variety of date formats in Dataverse, using the date module has become impractical. I’ll be reviewing and merging the open PRs over the next two weeks for the upcoming release, which will include all the new features as well as fixes.

JR-1991 commented 1 month ago

I have merged the PR and the fix is now available on the main branch. You can use the updated version now, by using the following command:

pip install git+https://github.com/gdcc/easyDataverse.git

Here is a colab notebook that uses the current version and assigns the time period via strings. Loading the dataset now also works:

image
kbrueckmann commented 1 month ago

Thanks for your quick replies and help, @pdurbin and @JR-1991 ! I just tested the fix (after the pip install, of course), but I'm still having difficulties. The rich.print() in my code below never happens, because the dataset loading fails with the same ValidationError as before. I think the difference to the shared colab might be that I'm not setting the time period values but rather just loading a dataset in which they were previously entered via the GUI (or somehow the update to the fix didn't work, but I got no error messages indicating that).

Here is what I'm doing:

    dataverse = Dataverse(
        server_url="https://heidata.uni-heidelberg.de/",
        api_token=api_token
    )

    dataset = dataverse.load_dataset(
        pid=pid,
        download_files=False
    )

    rich.print(dataset.citation)

The pid is the string "https://doi.org/10.11588/data/DVU14P". I can't share my API token, but at least for fetching data this one should work: 637c97c7-042e-4f00-b597-3736f07fe8a4 .

JR-1991 commented 1 month ago

@kbrueckmann thanks for sharing! I have tested your case and the issue stems from the wrong pid format. Dataverse expects the DOI in the format that is presented at your dataset instead of a link. You can find it within the Citation metadata block:

image

When using doi:10.11588/data/DVU14P the code does not fail anymore and the dataset is printed as expected. I have also tested it with your API Token and it worked as well. I would suggest recreating your token to prevent any malicious use.

image

Hope that helped. Please let me know if there are any other issues, happy to help 🙌

kbrueckmann commented 1 month ago

After changing the pid to the required format, I still had the same problem – so just to make sure it wasn't connected to any issues with the update I set up a new venv; did a fresh install of the necessary packages and now it's working perfectly. Thank you so much, @JR-1991 !