Closed HongjigLee closed 1 year ago
pyreadstat supports writing only in utf-8. Please transform all your strings in your dataframe to utf-8 before writing.
Thank you for your reply. However, referring to what you said, I tried again after processing as below, but the Korean language still looks broken. Is there another reason? It's not easy because it's been a while since I used Python and Pandas ^^;;;;
-> domainDf["TESTCD"] = domainDf["TESTCD"].str.encode("utf-8") -> domainDf["TESTCD"] = domainDf["TESTCD"].str.decode("utf-8", errors='strict') pyreadstat.write_xport(domainDf, 'a.xpt', table_name=domain, file_label=studyOid, column_labels=headerLabels, file_format_version=5)
In that case please do a complete error report: write down the exact error, fill in the technical details asked in the template and most importantly provide a file and code to reproduce the error. If there is no way to reproduce I cant look at it.
import sys import pandas as pd import pyreadstat
""" Start """ col = ['Nature', 'capital'] row = ['row1', 'row2', 'row3'] data = [['대한민국', '서울'], ['미국', '워싱턴DC'], ['프랑스', '파리']] df = pd.DataFrame(data, row, col) print(df) """ Check """ df["Nature"] = df["Nature"].str.encode("utf-8") print(df) df["Nature"] = df["Nature"].str.decode("utf-8", errors='strict') print(df)
pyreadstat.write_xport(df, 'korean.xpt', table_name="world", file_label="Capital", file_format_version=5)
sys.exit("=== End ===")
Please specify in what operating system you are working.
Please explain what program is that you are using to check and showi g in the screenshot. Have you checked what encoding is used there? It has to be utf-8.
Your code doesnt seem right because you do encoding and decoding both in utf-8. You should translate from the encoding you are using in Korean to utf-8, and leave it there.
Check this for example: https://stackoverflow.com/questions/6539881/python-converting-from-iso-8859-1-latin1-to-utf-8
Another one: write the xpt. Then read it again using Pyreadstat. What do you see? If it is correct, then it is not a Pyreadstat issue, but something in your downstream program.
The OS in use is windows 10, and the program that showed korean.xpt is sasviewer.
import sys import pandas as pd import pyreadstat
""" Start """ col = ['Nature', 'capital'] row = ['row1', 'row2', 'row3'] data = [['대한민국', '서울'], ['미국', '워싱턴DC'], ['프랑스', '파리']] df = pd.DataFrame(data, row, col) print(df) """ Check """ df["Nature"] = df["Nature"].str.encode("utf-8") print(df)
pyreadstat.write_xport(df, 'korean.xpt', table_name="world", file_label="Capital", file_format_version=5)
sys.exit("=== End ===") ############################################################ I tried omitting the decode part, but the result is still the same. The Nature part is shown as it is encoded.(utf-8)
I can't reproduce your issue with this code
import pandas as pd
import pyreadstat
col = ['Nature', 'capital']
row = ['row1', 'row2', 'row3']
data = [['대한민국', '서울'], ['미국', '워싱턴DC'], ['프랑스', '파리']]
df = pd.DataFrame(data, row, col)
print("original")
print(df)
pyreadstat.write_xport(df, 'korean.xpt', table_name="world", file_label="Capital", file_format_version=5)
df, meta = pyreadstat.read_xport("korean.xpt")
print("saved")
print(df)
it looks good in python
and it looks good in SAS
That means the problem is on your side and not on pyreadstat. Check two things: first that you have set the appropiate encoding in that sasviewer program. Second, check what encoding are you using in your python. Assuming it is using korean encoding, i.e. , then your code is still not correct, it should be something like this:
df["Nature"] = df["Nature"].str.decode("cp949").encode("utf-8")
change cp949 in case this is not your encoding.
Thank you very much for your support. I will compile the information you gave me and check it one by one and test it. thank you again. :-)
I took a hint from what you said yesterday and checked the dataset (xpt) containing Hangul in sas, and I can see Hangul. The program I checked the xpt data is SAS universal viewer version 1.4.2.1420, but it does not show Korean. The pyreadstat you created worked just fine. Thank you again for your response. ^^
Glad to see that your problem is solved! Just to see if I understand: what you mean is that just saving your dataframe to xpt with pyreadstat out of the box was correct when visualizing it in SAS but not in sasviewer? Or did you have to do any transformation to your dataframe before saving it to xpt? If you did any transformation can you please share what did work? I guess it will be useful for others in the future.
In the process of converting from DB to create a data set for SDTM submission, xpt was created at the final stage, but there was no way to check it, so it was a problem that occurred in the process of checking with sasviewer. Pyreadstat helped a lot and I was able to see the limitations of the xpt format itself. I know that CDISC is trying to support DATASET-json, but I think it is a dimension to solve the limitations of xpt. ^.^
OK, thanks for the feedback!
I'm using this program that you made well. I'm writing thankfully, but I have a problem. When using write_xport, if there is Korean in the pandas, the Korean language is broken. How do I solve this problem? T.T -> pyreadstat.write_xport(domainDf, 'a.xpt', table_name=domain, file_label=studyOid, column_labels=headerLabels, file_format_version=5)