apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.27k stars 3.47k forks source link

A CSV parser improvement idea #32192

Open asfimport opened 2 years ago

asfimport commented 2 years ago

As I run a CSV reading test(reading from a big file with more than 200 columns and only needing four of them) and I found the CSV parser cost most of the execution time. 

20220621-174727.png

And I go through the ParseLine function, and I found Arrow will parse all columns of one row even though I just want only 4 columns, and I think it will be a great improvement if Arrow can add including_column to parser_option.

I want to ask if this idea works or if you guys don't do this for some reason. Thanks in advance.

Environment: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz Stepping: 7 CPU MHz: 1000.000 CPU max MHz: 2301.0000 CPU min MHz: 1000.0000 BogoMIPS: 4600.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 28160K NUMA node0 CPU(s): 0-19,40-59 NUMA node1 CPU(s): 20-39,60-79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d arch_capabilities Reporter: youngfn

Original Issue Attachments:

Note: This issue was originally created as ARROW-16867. Please see the migration documentation for further details.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Can you explain how you can parse some CSV columns without parsing all of them?

asfimport commented 2 years ago

youngfn: @pitrou now My  naive changed like below:

  1. MakeConversionSchema() : save including_column index  and  change the index value from absolute value to 0,1,2,3.... 20220622-11065.png

    2.  then pass the comlumn_indexs(use a set here) to BlockParserImpl through its contructor, so in ParseLine function I can just skip the column I don't need by find in comlumn_index

    now the result seems good in my test , it has reduce almost 1s from 3.8s --> 2.8s, but I'm not sure if this change will hurt other features

    20220622-111516.png

     

asfimport commented 2 years ago

Weston Pace / @westonpace:

then pass the comlumn_indexs(use a set here) to BlockParserImpl through its contructor, so in ParseLine function I can just skip the column I don't need by find in comlumn_index

Can you expand a bit more on this part (or provide a PR)? The parser's job is to figure out where all the delimiters are. It doesn't know ahead of time how many characters are in each field. So, for example, if we are only reading the field at index 3 then we don't know how many characters are used by indices 0-2 and there is no way to simply skip that part.

asfimport commented 2 years ago

Yibo Cai / @cyb70289: Agree with Weston that a PR or demo code will be helpful. Maybe we can skip copying unwanted fields to the output buffer, but we must still scanning and parsing the whole CSV data buffer field by field.

asfimport commented 2 years ago

youngfn: Sorry for my unclear expression. Yes, both of you ( @westonpace @cyb70289) are right, we can't get rid of scanning, instead, I just don't push every char to the output writer of the whole CSV, just like Yibo Cai said.

I love to provide a PR, but now my implementation is very rough and I want to do more tests. So I want to show you some demo code of my change. The key part of my implementation of  ParseLine(csv/parser.cc) is like below:

this func scans every row and once it gets the field I want then it'll push it to the writer, then if it got all the fields I want then it will return early. So the worse case still needs to scan the whole table, but in most cases will benefit from the early return.

20220628-200449.png