SebKrantz / collapse

Advanced and Fast Data Transformation in R
https://sebkrantz.github.io/collapse/
Other
627 stars 33 forks source link

add Chinese support for the package:collapse #579

Open anticmason opened 1 month ago

anticmason commented 1 month ago

Hi, I’m the user of your package:collapse from China. Recently,when I try to use it to improve work efficiency, I find it doesn't support Chinese very well,especially when encounter with Chinese header or field from a file to deal with, some functions used very frequently such as funique,,fsubset,collag,roworder(v),fgroup_by,join,pivot etc. I guess maybe more functions like listed above will get error or None result。Since I'm the heavy user of this package,could it be possible to fix this bug? Moreover, could you please write a function to read or write xlsx/csv ,which has an encoding parameter to be choosed like 'utf-8','gbk' etc。。。like pandas's read_csv,read_excel?(Since Data.table package doesn't support 'gbk' for the encoding parameter to read or write) Thanks a lot ! Looking forward to receiving your reply~

SebKrantz commented 1 month ago

Hi, so in general, this package is UTF8 only. I think supporting other character encoding would require checking the encoding of every string (since character vectors can be heterogeneous), which would really slow things down. I'm also really not sure where to start here and would possibly need help by people that understand more about Chinese and string encoding in C.

Regarding excel, at the moment I don't plan to create file readers/writers. The package is already quite large.

anticmason commented 1 month ago

Hi,

Regarding supporting other character encoding such as Chinese, you may get help from dplyr package,which deals with it well~

At 2024-05-22 18:40:28, "Sebastian Krantz" @.***> wrote:

Hi, so in general, this package is UTF8 only. I think supporting other character encoding would require checking the encoding of every string (since character vectors can be heterogeneous), which would really slow things down. I'm also really not sure where to start here and would possibly need help by people that understand more about Chinese and string encoding in C.

Regarding excel, at the moment I don't plan to create file readers/writers. The package is already quite large.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

SebKrantz commented 1 month ago

Thanks, dplyr is written in R so won’t be of much help. I will need to look at C-based packages such as data.table. How does it do?

In general, I’m thinking it may very well be possible to go beyond UTF8 in a performance friendly way by assuming that string vectors are homogenous.

Could you perhaps provide a set of reproducible examples (using reprex::reprex()) of the different ways collapse currently fails? That would greatly help me test any internal improvements towards that end.

anticmason commented 1 month ago

Hi,

I always use pandas and numpy in Python;dplyr,data.table,collapse in R during my work。

pandas deals it well,it is written in python and C? its collection of read functions always has an encoding argument which supports 'gbk' and 'utf-8';

data.table now has a problem of reading,since it lacks the option of encoding that supports 'gbk',I've already written some emails to communicate with the writter,waiting for the reply... But if I adujst the encoding argument to 'gbk' via note-pad++ or pandas in advance,it deals well;

When encountered with such file of encoding 'gbk',I usually deals with it by note-pad++ or pandas in advance,and then replace the field name from chinese name to english name,finally collapse package can deal well....

Convenience and performance are both important~

At 2024-05-25 18:24:26, "Sebastian Krantz" @.***> wrote:

Thanks, dplyr is written in R so won’t be of much help. I will need to look at C-based packages such as data.table. How does it do?

In general, I’m thinking it may very well be possible to go beyond UTF8 in a performance friendly way by assuming that string vectors are homogenous.

Could you perhaps provide a set of reproducible examples (using reprex::reprex()) if the different ways collapse currently fails? That would greatly help me test any internal improvements towards that end.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

SebKrantz commented 1 month ago

Thanks. Python packages are not that useful since R has its own C API. Going forward it would be helpfule if you could indeed provide some reprex using simply your hand-typed chinese characters (mock data frames) and demonstrating how collapse currently fails. Then I'll see what can be done.