JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Feather.write mangles strings when they contain certain non ascii characters #91

Closed onetonfoot closed 6 years ago

onetonfoot commented 6 years ago

I've got a data frame with some strings that contain Chinese characters and possibly other non ascii stuff. When I write this to feather and then read it back, it causes the string to become mangled. An example of 4 strings.

 "C++ Algo Strategy Developer"                                                                        
 "Reporting Lead, Retail Operating Solutions, Asia Pacific (1-year contract)"                         
 "QA Senior Officer or Officer"                                                                       
 "Software Specialist"    

After Reading back

"er / IT Trainee (Fresh welc"                                                                       
 "ome)Quality Assurance - Team Lead (APAC)BA – Business Process Re-enginee"                          
 "ring (Ref: BPR/AU)System Ana"                                                                      
 "lystDeveloper - Jav"    

The problem seems to go away if I filter the text for only ascii chars.

ExpandingMan commented 6 years ago

Hm, well this is embarassing. It looks like we are indeed handling beyond the first 8 bits of UTF-8 incorrectly. This will require a fix to Arrow.jl, I'll try to do it today.

ExpandingMan commented 6 years ago

This should be fixed by Arrow v0.2.2 so please try again once the Arrow tag has merged to METADATA.

Sadly the feather files you wrote before the Arrow v0.2.2 patch are corrupted (as I'm sure you probably already knew), sorry about that.

onetonfoot commented 6 years ago

No worries, thanks for the quick fix.