TheDataStation / ver

Data Discovery Tools and Systems
MIT License
6 stars 10 forks source link

Duplicate text profiles from ddprofiler #76

Open snowgy opened 4 months ago

snowgy commented 4 months ago

The text profiles produced by the ddprofiler contain duplicate column profiles, making the dindex_builder take an extra long time and disk space to create the full-text search index.

Reproduce this issue:

  1. Download chicago open data. https://uchicago.box.com/s/ecmb69h874qwedj19ebncvu0qvd4n97h
  2. Follow the quick start guide to index the data
  3. Check the output_profiles_json/text

For example, in 0.csv, you can find the month_name in x2vd-qke7.csv is indexed twice.

"1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL" "1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"

Since dindex_builder reads the text profile to build the full-text-search index, duplicates here will lead to extra indexing time and space.