cran2367 / sgt

Sequence Graph Transform
104 stars 21 forks source link

OverflowError: cannot serialize a bytes object larger than 4 GiB #10

Open daehwanahn opened 3 years ago

daehwanahn commented 3 years ago

Hi,

When I used multiprocessing, SGT makes OverflowError. This is just for reporting. I'll consider 1) pyspark instead of pandarallel or 2) split the datasets.


RemoteTraceback Traceback (most recent call last) RemoteTraceback: """ Traceback (most recent call last): File "/home/dannyanexp/miniconda3/envs/tf/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/home/dannyanexp/miniconda3/envs/tf/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/home/dannyanexp/miniconda3/envs/tf/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 64, in global_worker return _func(x) File "/home/dannyanexp/miniconda3/envs/tf/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 120, in wrapper pickle.dump(result, file) OverflowError: cannot serialize a bytes object larger than 4 GiB """

The above exception was the direct cause of the following exception:

OverflowError Traceback (most recent call last)

in 11 start = time.time() 12 sgt = SGT(kappa=1, alphabets=alphabets, lengthsensitive=True, mode='multiprocessing') ---> 13 train_embedding = sgt.fit_transform(df_train) 14 test_embedding = sgt.transform(df_test) 15 train_embedding.to_csv('train_embed_f' + str(i+1) + '.csv') ~/miniconda3/envs/tf/lib/python3.7/site-packages/sgt/sgt.py in fit_transform(self, corpus) 214 list(self.fit(x['sequence'])), 215 axis=1, --> 216 result_type='expand') 217 sgt.columns = ['id'] + self.feature_names 218 return sgt ~/miniconda3/envs/tf/lib/python3.7/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs) 460 input_files, 461 output_files, --> 462 map_result, 463 ) 464 ~/miniconda3/envs/tf/lib/python3.7/site-packages/pandarallel/pandarallel.py in get_workers_result(use_memory_fs, nb_workers, show_progress_bar, nb_columns, queue, chunk_lengths, input_files, output_files, map_result) 394 progress_bars.update(progresses) 395 --> 396 results = map_result.get() 397 398 return ( ~/miniconda3/envs/tf/lib/python3.7/multiprocessing/pool.py in get(self, timeout) 655 return self._value 656 else: --> 657 raise self._value 658 659 def _set(self, i, obj): OverflowError: cannot serialize a bytes object larger than 4 GiB