Roche / pyreadstat

Python package to read sas, spss and stata files into pandas data frames. It is a wrapper for the C library readstat.
Other
322 stars 60 forks source link

When using write_xport, the Korean language is broken #217

Closed HongjigLee closed 1 year ago

HongjigLee commented 1 year ago

I'm using this program that you made well. I'm writing thankfully, but I have a problem. When using write_xport, if there is Korean in the pandas, the Korean language is broken. How do I solve this problem? T.T -> pyreadstat.write_xport(domainDf, 'a.xpt', table_name=domain, file_label=studyOid, column_labels=headerLabels, file_format_version=5)

ofajardo commented 1 year ago

pyreadstat supports writing only in utf-8. Please transform all your strings in your dataframe to utf-8 before writing.

HongjigLee commented 1 year ago

Thank you for your reply. However, referring to what you said, I tried again after processing as below, but the Korean language still looks broken. Is there another reason? It's not easy because it's been a while since I used Python and Pandas ^^;;;;

-> domainDf["TESTCD"] = domainDf["TESTCD"].str.encode("utf-8") -> domainDf["TESTCD"] = domainDf["TESTCD"].str.decode("utf-8", errors='strict') pyreadstat.write_xport(domainDf, 'a.xpt', table_name=domain, file_label=studyOid, column_labels=headerLabels, file_format_version=5)

ofajardo commented 1 year ago

In that case please do a complete error report: write down the exact error, fill in the technical details asked in the template and most importantly provide a file and code to reproduce the error. If there is no way to reproduce I cant look at it.

HongjigLee commented 1 year ago

Thank you for your quick reply. I don't know how to report the sample code, but I think you can quickly know it with the source containing simple Korean characters below. You can check that the generated "korean.xpt" is broken differently from the Korean language shown in the source. ^^

import sys import pandas as pd import pyreadstat

""" Start """ col = ['Nature', 'capital'] row = ['row1', 'row2', 'row3'] data = [['대한민국', '서울'], ['미국', '워싱턴DC'], ['프랑스', '파리']] df = pd.DataFrame(data, row, col) print(df) """ Check """ df["Nature"] = df["Nature"].str.encode("utf-8") print(df) df["Nature"] = df["Nature"].str.decode("utf-8", errors='strict') print(df)

pyreadstat.write_xport(df, 'korean.xpt', table_name="world", file_label="Capital", file_format_version=5)

sys.exit("=== End ===")

korean

ofajardo commented 1 year ago

Please specify in what operating system you are working.

Please explain what program is that you are using to check and showi g in the screenshot. Have you checked what encoding is used there? It has to be utf-8.

Your code doesnt seem right because you do encoding and decoding both in utf-8. You should translate from the encoding you are using in Korean to utf-8, and leave it there.

Check this for example: https://stackoverflow.com/questions/6539881/python-converting-from-iso-8859-1-latin1-to-utf-8

ofajardo commented 1 year ago

Another one: write the xpt. Then read it again using Pyreadstat. What do you see? If it is correct, then it is not a Pyreadstat issue, but something in your downstream program.

HongjigLee commented 1 year ago

The OS in use is windows 10, and the program that showed korean.xpt is sasviewer.

import sys import pandas as pd import pyreadstat

""" Start """ col = ['Nature', 'capital'] row = ['row1', 'row2', 'row3'] data = [['대한민국', '서울'], ['미국', '워싱턴DC'], ['프랑스', '파리']] df = pd.DataFrame(data, row, col) print(df) """ Check """ df["Nature"] = df["Nature"].str.encode("utf-8") print(df)

pyreadstat.write_xport(df, 'korean.xpt', table_name="world", file_label="Capital", file_format_version=5)

sys.exit("=== End ===") ############################################################ I tried omitting the decode part, but the result is still the same. The Nature part is shown as it is encoded.(utf-8)

korean

ofajardo commented 1 year ago

I can't reproduce your issue with this code

import pandas as pd
import pyreadstat

col = ['Nature', 'capital']
row = ['row1', 'row2', 'row3']
data = [['대한민국', '서울'], ['미국', '워싱턴DC'], ['프랑스', '파리']]
df = pd.DataFrame(data, row, col)
print("original")
print(df)

pyreadstat.write_xport(df, 'korean.xpt', table_name="world", file_label="Capital", file_format_version=5)

df, meta = pyreadstat.read_xport("korean.xpt")
print("saved")
print(df)

it looks good in python

image

and it looks good in SAS

image

That means the problem is on your side and not on pyreadstat. Check two things: first that you have set the appropiate encoding in that sasviewer program. Second, check what encoding are you using in your python. Assuming it is using korean encoding, i.e. , then your code is still not correct, it should be something like this:

df["Nature"] = df["Nature"].str.decode("cp949").encode("utf-8") 

change cp949 in case this is not your encoding.

HongjigLee commented 1 year ago

Thank you very much for your support. I will compile the information you gave me and check it one by one and test it. thank you again. :-)

HongjigLee commented 1 year ago

I took a hint from what you said yesterday and checked the dataset (xpt) containing Hangul in sas, and I can see Hangul. The program I checked the xpt data is SAS universal viewer version 1.4.2.1420, but it does not show Korean. The pyreadstat you created worked just fine. Thank you again for your response. ^^

ofajardo commented 1 year ago

Glad to see that your problem is solved! Just to see if I understand: what you mean is that just saving your dataframe to xpt with pyreadstat out of the box was correct when visualizing it in SAS but not in sasviewer? Or did you have to do any transformation to your dataframe before saving it to xpt? If you did any transformation can you please share what did work? I guess it will be useful for others in the future.

HongjigLee commented 1 year ago

In the process of converting from DB to create a data set for SDTM submission, xpt was created at the final stage, but there was no way to check it, so it was a problem that occurred in the process of checking with sasviewer. Pyreadstat helped a lot and I was able to see the limitations of the xpt format itself. I know that CDISC is trying to support DATASET-json, but I think it is a dimension to solve the limitations of xpt. ^.^

ofajardo commented 1 year ago

OK, thanks for the feedback!