ImperialCollegeLondon / safedata_validator

Python tools to validate and publish datasets using the safedata metadata format.
https://safedata-validator.readthedocs.io/
MIT License
2 stars 4 forks source link

Publication process generates `UnicodeDecodeError` on windows #180

Closed jacobcook1995 closed 2 weeks ago

jacobcook1995 commented 1 month ago

On windows some datasets have resulted in the following error.

C:\Users\samle>safedata_zenodo publish_dataset "C:\Users\samle\Desktop\SEARPP_Datasets\GTE_Data\climate\dummy_data.xlsx" "C:\Users\samle\Desktop\SEARPP_Datasets\GTE_Data\climate\dummy_data.json" --sandbox --no-xml
- Configuring Resources
    - Configuring resources from user_path: C:\Users\samle\AppData\Local\safedata_validator\safedata_validator.cfg
    - Validating gazetteer: C:\Users\samle\Desktop\SEARPP_Datasets\SEARPP_Data_Validation\SEARPP_spatial_gazetteer.geojson
    - Validating location aliases: C:\Users\samle\Desktop\SEARPP_Datasets\SEARPP_Data_Validation\SEARPP_location_aliases.csv
    - Validating GBIF database: C:\Users\samle\Desktop\SEARPP_Datasets\SEARPP_Data_Validation\gbif_backbone_2023-08-28.sqlite
    - Validating NCBI database: C:\Users\samle\Desktop\SEARPP_Datasets\SEARPP_Data_Validation\ncbi_taxonomy_2024-01-01.sqlite
    - Configuration does not use project IDs.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\samle\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts\safedata_zenodo.exe\__main__.py", line 7, in <module>
  File "C:\Users\samle\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\safedata_validator\entry_points.py", line 987, in _safedata_zenodo_cli
    dataset_json = simplejson.load(ds_json)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\samle\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\simplejson\__init__.py", line 452, in load
    return loads(fp.read(),
                 ^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2kfra8p0\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 655: character maps to <undefined>

I was not able to replicate this issue on macOS, so I believe that it is a Windows specific issue. My current guess is this is due to windows and unix systems using different default encodings, cp1252 and utf-8, respectively.

jacobcook1995 commented 3 weeks ago

This error doesn't seem to actually be windows specific, instead it seems to arise when the arguments to safedata_zenodo publish_dataset are provided in the wrong order, i.e. Example.xlsx first rather than Example.json.

I guess we want to add an informative error in this case, and to make it clearer in the documentation that this order is required?

I've already make some changes to the release branch to enforce utf-8 encoding, I'm not sure whether it makes sense to revert these or not?

davidorme commented 3 weeks ago

I'd leave the utf8 commits I think. Just check that it is also enforced when writing it? I should have spotted this order error. We could make those arguments named not positional or we could add warnings for the order. I think if lean towards explicit flags as it is too easy to make a mistake but then not sure if that makes them optional or not