The text profiles produced by the ddprofiler contain duplicate column profiles, making the dindex_builder take an extra long time and disk space to create the full-text search index.
For example, in 0.csv, you can find the month_name in x2vd-qke7.csv is indexed twice.
"1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"
"1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"
Since dindex_builder reads the text profile to build the full-text-search index, duplicates here will lead to extra indexing time and space.
The text profiles produced by the ddprofiler contain duplicate column profiles, making the
dindex_builder
take an extra long time and disk space to create the full-text search index.Reproduce this issue:
output_profiles_json/text
For example, in
0.csv
, you can find themonth_name
inx2vd-qke7.csv
is indexed twice."1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL" "1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"
Since
dindex_builder
reads the text profile to build the full-text-search index, duplicates here will lead to extra indexing time and space.