Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 974 forks source link

add support for the Chinese encoding to read or write in fread or fwrite #6148

Open anticmason opened 3 months ago

anticmason commented 3 months ago

Hi, I'm the heavy user for this package from China. Could it be possible to add an option for the argument:encoding from fread or fwrite function,which couldn't deal with Chinese encoding file correctly,like pandas's read_csv or read_excel ? Thanks a lot! Looking forward to your reply~

tdhock commented 3 months ago

it may be possible, but can you please provide a minimal reproducible example? ideally something like

fread(text="some chinese characters in gbk encoding")

or upload a small file (1 or 2 rows) which represents your issue. Also can you please explain what behavior you get, and what behavior you expected? you wrote "fread or fwrite function,which couldn't deal with Chinese encoding file correctly" -- what does it mean to deal with it correctly? or not?

anticmason commented 3 months ago

Hi, I've attached the file,when I use fread function to read data,it comes with messy code,it can't recognize correctly,encoding argument doesn't have an option to deal with it。 However, when I use pandas's read_csv, like: pd.read_csv('file path',encoding='gbk'), it goes well。 Thanks~ APP原生分区域点击日报表_20240523_1716426197475.csv

tdhock commented 3 months ago

I don't get any error from fread, what do you get? what did you expect to get?

> data.table::fread("~/Downloads/APP._20240523_1716426197475.csv")
   ͳ\xbc\xc6\xc8\xd5\xc6\xda \xb5\xe3\xbb\xf7ʡ\xb7ݱ\xe0\xc2\xeb
                       <int>                              <int>
1:                  20240519                                210
2:                  20240519                                210
   \xb5\xe3\xbb\xf7ʡ\xb7\xdd\xc3\xfb\xb3\xc6                   Ƶ\xb5\xc0
                                      <char>                      <char>
1:                                 \xc9Ϻ\xa3 \xd3\xe0\xc1\xbf\xb2\xe9ѯH5
2:                                 \xc9Ϻ\xa3 \xd3\xe0\xc1\xbf\xb2\xe9ѯH5
                                                                     \xc7\xf8\xd3\xf2
                                                                               <char>
1:                                         \xbfͷ\xfe\xd2\xfd\xb5\xbc\xb8\xa1\xcc\xf52
2: \xc1\xf7\xc1\xbf\xc3\xf7ϸ\xc7\xf8-\xb9\xfa\xc4\xda\xc6\xe4\xcb\xfb\xc1\xf7\xc1\xbf
             \xb9\xe3\xb8\xe6λ
                        <char>
1:               \xb9رհ\xb4ť-1
2: \xc1˽\xe2\xb8\xfc\xb6\xe0-1
                                                                                                                   \xb9\xe3\xb8\xe6λȫ\xb3\xc6
                                                                                                                                       <char>
1:                                                       \xd3\xe0\xc1\xbf\xb2\xe9ѯH5_\xbfͷ\xfe\xd2\xfd\xb5\xbc\xb8\xa1\xcc\xf52_\xb9رհ\xb4ť-1
2: \xd3\xe0\xc1\xbf\xb2\xe9ѯH5_\xc1\xf7\xc1\xbf\xc3\xf7ϸ\xc7\xf8-\xb9\xfa\xc4\xda\xc6\xe4\xcb\xfb\xc1\xf7\xc1\xbf_\xc1˽\xe2\xb8\xfc\xb6\xe0-1
   \xb9\xe3\xb8\xe6λ\xc4\xda\xc8\xdd \xb4\xa5\xb5\xe3\xb1\xe0\xc2\xeb
                              <char>                           <char>
1:         ȫ\xb2\xbf\xc4\xda\xc8\xdd        ȫ\xb2\xbf\xc4\xda\xc8\xdd
2:         ȫ\xb2\xbf\xc4\xda\xc8\xdd        ȫ\xb2\xbf\xc4\xda\xc8\xdd
   ҵ\xce\xf1\xb7\xd6\xc0\xe0          Ӫ\xcf\xfa\xb7\xbdʽ
                      <lgcl>                      <char>
1:                        NA ȫ\xb2\xbfӪ\xcf\xfa\xb7\xbdʽ
2:                        NA ȫ\xb2\xbfӪ\xcf\xfa\xb7\xbdʽ
   \xc9ϼ\xdc\xc8\xd5\xc6\xda \xc9ϼ\xdcʱ\xbc\xe4 \xcf¼\xdcʱ\xbc\xe4
                      <char>             <char>             <char>
1:        ȫ\xb2\xbfʱ\xbc\xe4 ȫ\xb2\xbfʱ\xbc\xe4 ȫ\xb2\xbfʱ\xbc\xe4
2:        ȫ\xb2\xbfʱ\xbc\xe4 ȫ\xb2\xbfʱ\xbc\xe4 ȫ\xb2\xbfʱ\xbc\xe4
          \xb9\xa4\xb5\xa5ID \xb7\xa2\xb2\xbcʡ\xc3\xfb\xb3\xc6    PV    UV
                      <char>                            <char> <int> <int>
1: ȫ\xb2\xbf\xb9\xa4\xb5\xa5                ȫ\xb2\xbfʡ\xb7\xdd   789   689
2: ȫ\xb2\xbf\xb9\xa4\xb5\xa5                ȫ\xb2\xbfʡ\xb7\xdd  5580  5219
   \xd3û\xa7\xca\xfd \xc0\xb8Ŀ\xc8յ\xe3\xbb\xf7\xc2\xca
               <int>                             <char>
1:               692                                 --
2:              4466                                 --
   \xc0\xb8Ŀ\xc8\xd5\xc9\xf8\u0378\xc2\xca
                                    <char>
1:                                      --
2:                                      --
> 
anticmason commented 3 months ago

Hi,

This is obviously the messy code, since encoding argument from fread function doesn't support gbk,correct result attached,please check~

在 2024-05-25 03:55:44,"Toby Dylan Hocking" @.***> 写道:

I don't get any error from fread, what do you get? what did you expect to get?

data.table::fread("~/Downloads/APP._20240523_1716426197475.csv") ͳ\xbc\xc6\xc8\xd5\xc6\xda \xb5\xe3\xbb\xf7ʡ\xb7ݱ\xe0\xc2\xeb1:202405192102:20240519210 \xb5\xe3\xbb\xf7ʡ\xb7\xdd\xc3\xfb\xb3\xc6 Ƶ\xb5\xc01: \xc9Ϻ\xa3 \xd3\xe0\xc1\xbf\xb2\xe9ѯH52: \xc9Ϻ\xa3 \xd3\xe0\xc1\xbf\xb2\xe9ѯH5 \xc7\xf8\xd3\xf21: \xbfͷ\xfe\xd2\xfd\xb5\xbc\xb8\xa1\xcc\xf522: \xc1\xf7\xc1\xbf\xc3\xf7ϸ\xc7\xf8-\xb9\xfa\xc4\xda\xc6\xe4\xcb\xfb\xc1\xf7\xc1\xbf \xb9\xe3\xb8\xe6λ

1: \xb9رհ\xb4ť-12: \xc1˽\xe2\xb8\xfc\xb6\xe0-1 \xb9\xe3\xb8\xe6λȫ\xb3\xc61: \xd3\xe0\xc1\xbf\xb2\xe9ѯH5_\xbfͷ\xfe\xd2\xfd\xb5\xbc\xb8\xa1\xcc\xf52_\xb9رհ\xb4ť-12: \xd3\xe0\xc1\xbf\xb2\xe9ѯH5_\xc1\xf7\xc1\xbf\xc3\xf7ϸ\xc7\xf8-\xb9\xfa\xc4\xda\xc6\xe4\xcb\xfb\xc1\xf7\xc1\xbf_\xc1˽\xe2\xb8\xfc\xb6\xe0-1 \xb9\xe3\xb8\xe6λ\xc4\xda\xc8\xdd \xb4\xa5\xb5\xe3\xb1\xe0\xc2\xeb1: ȫ\xb2\xbf\xc4\xda\xc8\xdd ȫ\xb2\xbf\xc4\xda\xc8\xdd2: ȫ\xb2\xbf\xc4\xda\xc8\xdd ȫ\xb2\xbf\xc4\xda\xc8\xdd ҵ\xce\xf1\xb7\xd6\xc0\xe0 Ӫ\xcf\xfa\xb7\xbdʽ 1:NA ȫ\xb2\xbfӪ\xcf\xfa\xb7\xbdʽ 2:NA ȫ\xb2\xbfӪ\xcf\xfa\xb7\xbdʽ \xc9ϼ\xdc\xc8\xd5\xc6\xda \xc9ϼ\xdcʱ\xbc\xe4 \xcf¼\xdcʱ\xbc\xe41: ȫ\xb2\xbfʱ\xbc\xe4 ȫ\xb2\xbfʱ\xbc\xe4 ȫ\xb2\xbfʱ\xbc\xe42: ȫ\xb2\xbfʱ\xbc\xe4 ȫ\xb2\xbfʱ\xbc\xe4 ȫ\xb2\xbfʱ\xbc\xe4 \xb9\xa4\xb5\xa5ID \xb7\xa2\xb2\xbcʡ\xc3\xfb\xb3\xc6PVUV1: ȫ\xb2\xbf\xb9\xa4\xb5\xa5 ȫ\xb2\xbfʡ\xb7\xdd7896892: ȫ\xb2\xbf\xb9\xa4\xb5\xa5 ȫ\xb2\xbfʡ\xb7\xdd55805219 \xd3û\xa7\xca\xfd \xc0\xb8Ŀ\xc8յ\xe3\xbb\xf7\xc2\xca1:692--2:4466-- \xc0\xb8Ŀ\xc8\xd5\xc9\xf8\u0378\xc2\xca1:--2:-->

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

tdhock commented 3 months ago

hi, sorry, but I do not see your attachment. can you please try posting on github instead of email? also it is more useful to see code as text instead of screenshot/image, if that is possible.

anticmason commented 3 months ago

Hi, Here is the attachment with correct result via pandas。 correct result

ben-schwen commented 3 months ago

FWIW you can read the file via

f = file("~/Downloads/APP._20240523_1716426197475.csv", encoding="gbk")
readLines(f)

So if we ever support reading via a connection (#561) this would be for free.

I also just wrote a small POC where we arrive at

fread("~/Downloads/APP._20240523_1716426197475.csv", header=FALSE)
         V1           V2           V3         V4            V5         V6
     <char>       <char>       <char>     <char>        <char>     <char>
1: 统计日期 点击省份编码 点击省份名称       频道          区域     广告位
2: 20240519          210         上海 余量查询H5 客服引导浮条2 关闭按钮-1
                                    V7         V8       V9      V10
                                <char>     <char>   <char>   <char>
1:                          广告位全称 广告位内容 触点编码 业务分类
2: 余量查询H5_客服引导浮条2_关闭按钮-1   全部内容 全部内容         
            V11      V12      V13      V14      V15        V16    V17    V18
         <char>   <char>   <char>   <char>   <char>     <char> <char> <char>
1:     营销方式 上架日期 上架时间 下架时间   工单ID 发布省名称     PV     UV
2: 全部营销方式 全部时间 全部时间 全部时间 全部工单   全部省份    789    689
      V19          V20          V21
   <char>       <char>       <char>
1: 用户数 栏目日点击率 栏目日渗透率
2:    692           --           --
Warning message:
In fread("~/Downloads/APP._20240523_1716426197475.csv",  : Discarded single-line footer: <<20240519,210,上海,余量查询H5,流量明细区-国内其他�>>

I just converted GBK to UTF-8 using iconv but this would also work for other encodings. The main design question is how we would control this? An addional argument to fread?

anticmason commented 3 months ago

Hi,

readLines(f) just returns characters,which can't be converted to data.frame or data.table directly... Moreover,data I'm dealing with always can be very large (around several million even ten million),In R, only fread can handle such amount of data(arrow::open_dataset also can but has a problem of reading and writing with gbk encoding too) .....

In addition, do you notice the warning message?

Warningmessage:In fread("/mnt/c/Users/BNS/Downloads/APP._20240523_1716426197475.csv", :Discardedsingle-linefooter:<<20240519,210,上海,余量查询H5,流量明细区-国内其他�>>

In the data, totally two rows,but only one row returned, it is discarded by fread. I remember that when I want to merge several datasets together into single,I use: purrr::map_dr(list.files('file-path'),fread), the result can be either warning message of discarding several lines of data or indicating that you should set fill=TRUE(But if i did,just an error throws saying that the data type inconsistent with each other...)【but if I use purrr::map_dr(list.files('file-path'),read.csv),it deals well although quite slow】。This could be another problem to be fixed....

In sum,

  1. supporting encoding of gbk for fread and fwrite function;

  2. fix the bug when merge several datasets into single with discarding or error messages when using fread function。

Thanks a lot~

在 2024-05-26 01:25:54,"Benjamin Schwendinger" @.***> 写道:

FWIW you can read the file via

f= file("~/Downloads/APP._20240523_1716426197475.csv", encoding="gbk") readLines(f)

So if we ever support reading via a connection (#561) this would be for free.

I also just wrote a small POC where we arrive at

fread("~/Downloads/APP._20240523_1716426197475.csv", header=FALSE) V1V2V3V4V5V61: 统计日期 点击省份编码 点击省份名称 频道 区域 广告位 2:20240519210 上海 余量查询H5 客服引导浮条2 关闭按钮-1V7V8V9V101: 广告位全称 广告位内容 触点编码 业务分类 2: 余量查询H5_客服引导浮条2_关闭按钮-1 全部内容 全部内容
V11V12V13V14V15V16V17V181: 营销方式 上架日期 上架时间 下架时间 工单ID 发布省名称 PVUV2: 全部营销方式 全部时间 全部时间 全部时间 全部工单 全部省份 789689V19V20V211: 用户数 栏目日点击率 栏目日渗透率 2:692----Warningmessage:In fread("/mnt/c/Users/BNS/Downloads/APP._20240523_1716426197475.csv", :Discardedsingle-linefooter:<<20240519,210,上海,余量查询H5,流量明细区-国内其他�>>

I just converted GBK to UTF-8 using iconv but this would also work for other encodings. The main design question is how we would control this? An addional argument to fread?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>