When using the entity service to deduplicate a ~4M record file I get 1.3TiB of results spread over ~15k files stored in the object store.
Various tasks fail without the run being marked as failed.
Example 1
04:48:51 INFO Writing 15028431 intermediate results to file: chunk-res-8b18f25cad56f1df1473c3f2.csv
04:48:52 INFO Timings: Prep: 0.5757 + 0.5808, Solve: 10.6048, Total: 11.7613 Comparisons: 1000014129
04:48:57 INFO Timings: Prep: 0.6274 + 0.6117, Solve: 10.2025, Total: 11.4415 Comparisons: 1000014129
04:48:57 INFO Writing 12872267 intermediate results to file: chunk-res-e1b370c60204256510ec910f.csv
04:49:00 ERROR Task entityservice.async_worker.compute_filter_similarity[9576084c-dd4f-43ea-add0-6b453ef88ea1] raised unexpected: OSError(12, 'Out of memory')
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
return self.run(*args, **kwargs)
File "/var/www/entityservice/async_worker.py", line 500, in compute_filter_similarity
csvwriter.writerow(res_tuple)
OSError: [Errno 12] Out of memory
04:49:00 ERROR Task entityservice.async_worker.compute_filter_similarity[c25c5c58-e8c0-4d20-b063-de3c41b27b92] raised unexpected: OSError(12, 'Out of memory')
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
return self.run(*args, **kwargs)
File "/var/www/entityservice/async_worker.py", line 500, in compute_filter_similarity
csvwriter.writerow(res_tuple)
OSError: [Errno 12] Out of memory
04:49:00 INFO Received task: entityservice.async_worker.compute_filter_similarity[03b03ce5-2f8e-4f5f-ac53-f1ba06ac6e3e]
04:49:00 INFO Received task: entityservice.async_worker.compute_filter_similarity[400bd4a7-eadb-40c2-a45b-eec0346d9cdf]
[2018-06-11 04:49:00,913: ERROR/MainProcess] Process 'ForkPoolWorker-905' pid:913 exited with 'signal 9 (SIGKILL)'
04:49:00 ERROR Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
04:49:00 INFO Received task: entityservice.async_worker.compute_filter_similarity[5eda0082-ee59-46ec-9ef5-d2f8491d94b9]
04:49:02 INFO Writing 9931101 intermediate results to file: chunk-res-b61863d4d74b606d512e3dba.csv
04:49:05 INFO Timings: Prep: 0.4629 + 0.2127, Solve: 4.1513, Total: 4.8269 Comparisons: 451007226
04:49:06 INFO Writing 2866334 intermediate results to file: chunk-res-5ed52dab2aefbbff61a8c757.csv
04:49:11 INFO Timings: Prep: 1.0070 + 0.4646, Solve: 9.1148, Total: 10.5864 Comparisons: 1000014129
04:49:11 INFO Timings: Prep: 0.4796 + 0.4983, Solve: 9.6453, Total: 10.6232 Comparisons: 1000014129
Example 2
05:23:46 INFO Timings: Prep: 0.4534 + 0.5002, Solve: 9.9839, Total: 10.9374 Comparisons: 1000014129
05:23:48 INFO Timings: Prep: 0.4858 + 0.4966, Solve: 8.9605, Total: 9.9429 Comparisons: 1000014129
05:23:49 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', OSError(12, 'Out of memory'))': /entityservice/chunk-res-1bed17a0ddb64ede62ad0df5.csv?partNumber=6&uploadId=e7958b1a-4df4-4bfd-8fda-27011ef7c65e
05:23:49 ERROR Task entityservice.async_worker.compute_filter_similarity[97a0f714-7d45-4dc7-adeb-bd8cfa797189] raised unexpected: UnrewindableBodyError('Unable to record file position for rewinding request body during a redirect/retry.',)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
return self.run(*args, **kwargs)
File "/var/www/entityservice/async_worker.py", line 506, in compute_filter_similarity
mc.fput_object(config.MINIO_BUCKET, result_filename, result_filename)
File "/usr/lib/python3.6/site-packages/minio/api.py", line 616, in fput_object
part_number)
File "/usr/lib/python3.6/site-packages/minio/api.py", line 1456, in _do_put_multipart_object
content_sha256=part_metadata.sha256hasher.hexdigest()
File "/usr/lib/python3.6/site-packages/minio/api.py", line 1744, in _url_open
preload_content=False)
File "/usr/lib/python3.6/site-packages/urllib3/poolmanager.py", line 321, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 668, in urlopen
**response_kw)
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 584, in urlopen
body_pos = set_file_position(body, body_pos)
File "/usr/lib/python3.6/site-packages/urllib3/util/request.py", line 83, in set_file_position
rewind_body(body, pos)
File "/usr/lib/python3.6/site-packages/urllib3/util/request.py", line 114, in rewind_body
raise UnrewindableBodyError("Unable to record file position for rewinding "
urllib3.exceptions.UnrewindableBodyError: Unable to record file position for rewinding request body during a redirect/retry.
05:23:49 INFO Received task: entityservice.async_worker.compute_filter_similarity[25551bfa-4d12-461f-a4f5-e2dc7a7a7de7]
[2018-06-11 05:23:49,646: ERROR/MainProcess] Process 'ForkPoolWorker-1091' pid:1099 exited with 'signal 9 (SIGKILL)'
05:23:49 ERROR Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
05:23:49 INFO Received task: entityservice.async_worker.compute_filter_similarity[fe57dbcf-d644-40fc-9372-050b4a752786]
05:23:51 INFO Writing 11018405 intermediate results to file: chunk-res-42f37ffbacefd87bf7afcc73.csv
05:23:51 INFO Writing 7012431 intermediate results to file: chunk-res-9a96494e6103b1a964d4e8fc.csv
Example 3 (Chord callback)
06:07:13 ERROR Chord callback for '29ccc4a7-8aa9-42ec-b65b-c0219eb540ad' raised: ChordError("Dependency e9331a92-8992-4c37-8963-9aafd994427f raised OSError('[Errno 12] Out of memory',)",)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/celery/backends/redis.py", line 289, in on_chord_part_return
callback.delay([unpack(tup, decode) for tup in resl])
File "/usr/lib/python3.6/site-packages/celery/backends/redis.py", line 289, in <listcomp>
callback.delay([unpack(tup, decode) for tup in resl])
File "/usr/lib/python3.6/site-packages/celery/backends/redis.py", line 242, in _unpack_chord_result
raise ChordError('Dependency {0} raised {1!r}'.format(tid, retval))
celery.exceptions.ChordError: Dependency e9331a92-8992-4c37-8963-9aafd994427f raised OSError('[Errno 12] Out of memory',)
06:07:14 INFO Task entityservice.async_worker.compute_filter_similarity[733388b0-d822-4224-b5c2-fa6cf5de742a] succeeded in 21.30179575100192s: (3620808, 'chunk-res-f39350bccfad0f2242225e3d.csv')
Note that at this point the run's state transitions to error.
The setup logs:
02:45:59 INFO Computing similarity for 3935514 x 3935514 entities
02:45:59 INFO Computing similarity using greedy method
02:45:59 INFO Chunks will contain 31623 entities per task
02:45:59 INFO Chunking into 15625 computation tasks each with (at most) 31623 entities.
...
02:46:12 INFO Timings: Prep: 0.4839 + 0.8999, Solve: 8.3547, Total: 9.7385 Comparisons: 1000014129
02:46:12 INFO Timings: Prep: 0.4845 + 0.6814, Solve: 8.5892, Total: 9.7550 Comparisons: 1000014129
02:46:12 INFO Writing 423547 intermediate results to file: chunk-res-b3842a39b74a7d6704765995.csv
02:46:12 INFO Writing 608184 intermediate results to file: chunk-res-e1584442b00465b99440e61b.csv
...
The first error occurred after about an hour of running, and then the worker was terminated and it was another hour until the next failure. So my guess is we have a memory leak. Green sidebar shows Error logs:
When using the entity service to deduplicate a ~4M record file I get 1.3TiB of results spread over ~15k files stored in the object store.
Various tasks fail without the run being marked as failed.
Example 1
Example 2
Example 3 (Chord callback)
Note that at this point the run's state transitions to error.
The setup logs:
The worker configuration: