HoloClean / holoclean

A Machine Learning System for Data Enrichment.
http://www.holoclean.io
Apache License 2.0
514 stars 129 forks source link

unable to process large data set of Food inspections. #96

Closed nuarc closed 4 years ago

nuarc commented 4 years ago

Hi, I am facing few issues in executing holoclean over Chicago Food inspections dataset as described in paper. However, it turned out to be a machine with 32GB RAM, 100 GB SSD - is unable to process, noticing memory leaks. Also, query results

  1. large data in pandas dataframe

14:12:20 - [ERROR] - generating aux_table pos_values Traceback (most recent call last): File "/home/ubuntu/hc36/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1236, in _execute_context cursor, statement, parameters, context File "/home/ubuntu/hc36/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 536, in do_execute cursor.execute(statement, parameters) psycopg2.DatabaseError: out of memory for query result

  1. Since above data frame was not used in subsequent operations, i moved forwarded by simply bypassing it for this code block, however after 4 hours of execution , i am now facing an issue where log just shows killed. 21:09:36 - [DEBUG] - Time to execute query: 0.00 secs 21:09:36 - [DEBUG] - featurizing training data... 21:09:43 - [DEBUG] - Time to execute query: 4.46 secs Killed It seems tensor.cat is blowing memory here... As shown below: running a sample for hospital and performing memory analysis, with little data. It turned out , the tensors footprint is 228 MB which grows further when combining all tensors. to 1.4GB. This is not scaling well, if we need to process of larger dataset of 2-4GB in a day.. Any suggestions ? image

  2. Also, we noticed archived version was compatible with spark, is there any specific reason to move away from spark. Since we are planning to use spark to manage huge datasets with holoclean. Any suggestions ?

nuarc commented 4 years ago

@richardwu @laferrieren @Ihabilyas @thodrek @zaqthss - can you pls help on above pls ?

thodrek commented 4 years ago

@nuarc the current code is designed to work on a large memory machine and preferably a large memory GPU. A 32GB machine might not be adequate.

nuarc commented 4 years ago

@thodrek - thanks for responding to my query. I have actually tried executing it over 732 GB machine..

  1. With single thread/batch size = 1 , process halts after 4 hours on tensors generation as shown in screenshot above
  2. With multiple processors and batch size of 50, it didnt complete even 3 days.