gdcc / pyDataverse

Python module for Dataverse Software (dataverse.org).
http://pydataverse.readthedocs.io/
MIT License
64 stars 45 forks source link

Encoding issue using create_dataset() #55

Open sindribaldur opened 3 years ago

sindribaldur commented 3 years ago

Using a Python script with create_dataset() I created a a new dataset on demo.dataverse.org (and one more dataverse server).

api = Api(base_url = dvserver, api_token = dvtoken)
api.create_dataset("1", dsmd)

Where dsmd is the content of dataset-finch1.json as a string (and slightly modified version of it for my last test) linked in the documentation.

dsmd = """{
  "datasetVersion": {
    "metadataBlocks": {
      "citation": {
        "fields": [
          {
            "value": "Dörwin's Fænches",
     .
     .
     .
"""

Everything seems to work fine but non ascii characters are not displayed (replace with ) when I open the dataverse through the browser nor when download it back with get_dataset().

I'm on Windows 10 with Python 3.6.4 and pyDataverse 0.2.1. I tried to run it as a script from the command line and in Spyder with the same result.

skasberger commented 3 years ago

Thanks for your issue!

Can you please try this again with the latest commit from the develop branch? (see here).

Here the develop branch and the docs for it. The next release with a lot of changes will be release until the end of the year, and hopefully the problem is solved by the changes. If not, I will try to reproduce and fix it.

And please test out, when the encoding conversions/problems start to appear. Is the imported json string already wrong, or does it happen after the create_dataset() call.

poikilotherm commented 3 years ago

Beware this might be related to IQSS/dataverse#6675 and some other issues (older & newer) I linked to from there.

I dug a bit and the plot thickens around encoding issues in Jersey. Yet it would be good to have verification that it's not the lib that's causing problems.

sindribaldur commented 3 years ago

I cloned this repository and tried rerunning the script by importing src/pyDataverse/api.py and it didn't fix the issue. The json string also seems to be ok within Python.

The high number of related open issues over at IQSS/dataverse suggests it has to do with some aspects of Dataverse itself but when I create a dataset through the user interface I can enter the characters that get lost via the API upload.

sindribaldur commented 3 years ago

I'm now creating datasets directly with curl and the instructions from here and have not come across any encoding issues.

skasberger commented 3 years ago

Please send me the dataset json file (contact information here: stefankasberger.at).

skasberger commented 3 years ago

@sindribaldur Have tested your script with the latest develop version and a local Dataverse Docker instance (4.18.1). Got the same problem as you. Dataset is created, but with many special characters missing (as you can see from the screenshot). The uploaded string is formatted the same as the string inside your script. When i request the Dataset then with pyDataverse (get_dataset()), the metadata responded has the same question special characters as seen on the frontend. So the problem seems to be after the create_dataset() call.

@pdurbin Do you know of this problem? It seems to be on the Dataverse side.

And @sindribaldur: I have found one error in your JSON string in your testy.py file you have sent me. At "Þvæla Tilraun`, the apostroph at the end is missing. Just in case. :)

Screenshot_2021-01-12 �v�la Tilraun

pdurbin commented 3 years ago

@skasberger hi, plenty of encoding problems have surfaced over the years but I'm not aware of this being a problem on the Dataverse side. It sounds like @sindribaldur got it working with curl. @sindribaldur would you be interested in trying it from https://github.com/IQSS/dataverse-client-javascript or https://github.com/IQSS/dataverse-client-r ? Those are the other two libraries that are quite active.

sindribaldur commented 3 years ago

@skasberger Thanks for getting back - I hope the example helps. @pdurbin Thank you, I got everything that I needed working directly with curl and guess I will continue using it like that if needed in the future. I had tried one of these packages before and also hit a wall.

pdurbin commented 7 months ago

As discussed during the 2024-02-14 meeting of the pyDataverse working group, we are closing old milestones in favor of a new project board at https://github.com/orgs/gdcc/projects/1 and removing issues (like this one) from those old milestones. Please feel free to join the working group! You can find us at https://py.gdcc.io and https://dataverse.zulipchat.com/#narrow/stream/377090-python