Blosc / bcolz

A columnar data container that can be compressed.
http://bcolz.blosc.org
959 stars 149 forks source link

ctable takes 16 hours (and still running) saving to disk - a better way?? #379

Open ghost opened 6 years ago

ghost commented 6 years ago

I am taking 71.6 GB of pickle files, converting to dataframes, creating ctables and appending to my rootdir ctable.

The process did not finish so I killed it after 16 hours... My data is 2 columns: float32 and object. The object's character length is around 200.

My code:

def saving_bcolz():
    """  this is the core save logic """
    files = [... my data files ...]
    cols = [np.zeros(0, dtype=dt) for dt in [np.dtype('float32'), np.dtype('object')]]
    ct = bcolz.ctable(cols, ['score','all_cols'], rootdir='/home/dump/using_bcolz_new/')

    for chunk in list(group(f1+f2, 10)):
        df = pd.concat([pd.read_pickle(f) for f in chunk], ignore_index=True)
        ct_import = bcolz.ctable.fromdataframe(df, expectedlen=len(df))
        del df;   gc.collect()
        ct.append(ct_import)
        del ct_import;  gc.collect()

def group(it, size):
    """    Create iterable on input `it` for every `size`.  """
    it = iter(it)
    return iter(lambda: tuple(itertools.islice(it, size)), ())

bcolz 1.2.1 pandas 0.22

Is there a better way to have bcolz store the data?

Any reason for the slowness?

alimanfoo commented 6 years ago

I'm guessing because you initialize the ctable with zero length columns, bcolz will pick a small chunk length, which will cause slowness. Suggest to try providing the expectedlen=n argument when creating the ctable where n is the total number of rows you expect in the final table. Alternatively you could try manually setting chunk length by providing chunklen=m where m is a parameter you can tune.

On Tue, 8 May 2018, 15:23 Ben Scully, notifications@github.com wrote:

I am taking 71.6 GB of pickle files, converting to dataframes, creating ctables and appending to my rootdir ctable.

The process did not finish so I killed it after 16 hours... My data is 2 columns: float32 and object. The object's character length is around 200.

My code:

def saving_bcolz(): """ this is the core save logic """ files = [... my data files ...] cols = [np.zeros(0, dtype=dt) for dt in [np.dtype('float32'), np.dtype('object')]] ct = bcolz.ctable(cols, ['score','all_cols'], rootdir='/home/dump/using_bcolz_new/')

for chunk in list(group(f1+f2, 10)):
    df = pd.concat([pd.read_pickle(f) for f in chunk], ignore_index=True)
    ct_import = bcolz.ctable.fromdataframe(df, expectedlen=len(df))
    del df;   gc.collect()
    ct.append(ct_import)
    del ct_import;  gc.collect()

def group(it, size): """ Create iterable on input it for every size. """ it = iter(it) return iter(lambda: tuple(itertools.islice(it, size)), ())

bcolz 1.2.1 pandas 0.22

Is there a better way to have bcolz store the data?

Any reason for the slowness?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Blosc/bcolz/issues/379, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qp9fOyemSE7p584x5vkox4KLpi7Hks5twap-gaJpZM4T2vcn .

CarstVaartjes commented 6 years ago

Another thing that we do, that if the characters are recurring (think things such as product codes) we create an unique integer for each code, save the combination in a db or a pickle and the integer in bcolz, this will radically improve your performance If the character sets are not recurring then bcolz isn't probably the right use case for you and you might want to look at things such as elastisearch

ghost commented 6 years ago

ok thanks for the information, looks like bcolz doesn't fit my use case. dask was slow too. I'm just going to create an aws ec2 instance with massive amounts of ram :)

ps - eventually I will convert my object column to categorical aka hashmap. Will try bcolz again then.