write_dta:A provided string value was longer than the available storage size of the specified column

Roche / pyreadstat

Python package to read sas, spss and stata files into pandas data frames. It is a wrapper for the C library readstat.

Other

330 stars 61 forks source link

write_dta:A provided string value was longer than the available storage size of the specified column #268

Open shezhou opened 3 months ago

shezhou commented 3 months ago

arg1 = 'E:\test\single\dta\PRI_Basic.json' arg3 = 'E:\test\single\dta\PRI_Basic.dta' df = pd.read_json(arg1, dtype=dtype_dict, lines=True) pyreadstat.write_dta(df, arg3) The following error occurred: yreadstat._readstat_parser.ReadstatError: A provided string value was longer than the available storage size of the specified column View history lssues It seems to only solve the SAV format,

ofajardo commented 3 months ago

hi, thanks for the report. Please provide the data to reproduce the problem as indicated in the template.

shezhou commented 3 months ago

hi, thanks for the report. Please provide the data to reproduce the problem as indicated in the template.

Thank you for your reply, this is data . You can ignore dtype=dtype_dict in df = pd.read_json(arg1, dtype=dtype_dict, lines=True) PRI_Basic.json

ofajardo commented 2 months ago

in order to make the issue reproducible, please provide dtype_dict ( and any other information necessary to reproduce)

ofajardo commented 2 months ago

OK, it seems that the issue is that you have one specific row with a very long string (of length 1988). Right now pyreadstat is writing it as dta type str which max length is 2045 bytes (that means ~1020 python characters). It seems that there is a way to write the newer strL type that can have much longer strings (see here), I can see if I can implement that in the future. For now the solution is to avoid writing such long strings, you could for example split them in multiple columns.