data61 / anonlink-entity-service

Privacy Preserving Record Linkage Service
Apache License 2.0
26 stars 8 forks source link

Dealing with large results #211

Open hardbyte opened 6 years ago

hardbyte commented 6 years ago

When using the entity service to deduplicate a ~4M record file I get 1.3TiB of results spread over ~15k files stored in the object store.

Various tasks fail without the run being marked as failed.

Example 1

04:48:51   INFO     Writing 15028431 intermediate results to file: chunk-res-8b18f25cad56f1df1473c3f2.csv
04:48:52   INFO     Timings: Prep: 0.5757 + 0.5808, Solve: 10.6048, Total: 11.7613  Comparisons: 1000014129
04:48:57   INFO     Timings: Prep: 0.6274 + 0.6117, Solve: 10.2025, Total: 11.4415  Comparisons: 1000014129
04:48:57   INFO     Writing 12872267 intermediate results to file: chunk-res-e1b370c60204256510ec910f.csv
04:49:00   ERROR    Task entityservice.async_worker.compute_filter_similarity[9576084c-dd4f-43ea-add0-6b453ef88ea1] raised unexpected: OSError(12, 'Out of memory')
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
    return self.run(*args, **kwargs)
  File "/var/www/entityservice/async_worker.py", line 500, in compute_filter_similarity
    csvwriter.writerow(res_tuple)
OSError: [Errno 12] Out of memory
04:49:00   ERROR    Task entityservice.async_worker.compute_filter_similarity[c25c5c58-e8c0-4d20-b063-de3c41b27b92] raised unexpected: OSError(12, 'Out of memory')
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
    return self.run(*args, **kwargs)
  File "/var/www/entityservice/async_worker.py", line 500, in compute_filter_similarity
    csvwriter.writerow(res_tuple)
OSError: [Errno 12] Out of memory
04:49:00   INFO     Received task: entityservice.async_worker.compute_filter_similarity[03b03ce5-2f8e-4f5f-ac53-f1ba06ac6e3e]  
04:49:00   INFO     Received task: entityservice.async_worker.compute_filter_similarity[400bd4a7-eadb-40c2-a45b-eec0346d9cdf]  
[2018-06-11 04:49:00,913: ERROR/MainProcess] Process 'ForkPoolWorker-905' pid:913 exited with 'signal 9 (SIGKILL)'
04:49:00   ERROR    Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
04:49:00   INFO     Received task: entityservice.async_worker.compute_filter_similarity[5eda0082-ee59-46ec-9ef5-d2f8491d94b9]  
04:49:02   INFO     Writing 9931101 intermediate results to file: chunk-res-b61863d4d74b606d512e3dba.csv
04:49:05   INFO     Timings: Prep: 0.4629 + 0.2127, Solve: 4.1513, Total: 4.8269  Comparisons: 451007226
04:49:06   INFO     Writing 2866334 intermediate results to file: chunk-res-5ed52dab2aefbbff61a8c757.csv
04:49:11   INFO     Timings: Prep: 1.0070 + 0.4646, Solve: 9.1148, Total: 10.5864  Comparisons: 1000014129
04:49:11   INFO     Timings: Prep: 0.4796 + 0.4983, Solve: 9.6453, Total: 10.6232  Comparisons: 1000014129

Example 2

05:23:46   INFO     Timings: Prep: 0.4534 + 0.5002, Solve: 9.9839, Total: 10.9374  Comparisons: 1000014129
05:23:48   INFO     Timings: Prep: 0.4858 + 0.4966, Solve: 8.9605, Total: 9.9429  Comparisons: 1000014129
05:23:49   WARNING  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', OSError(12, 'Out of memory'))': /entityservice/chunk-res-1bed17a0ddb64ede62ad0df5.csv?partNumber=6&uploadId=e7958b1a-4df4-4bfd-8fda-27011ef7c65e
05:23:49   ERROR    Task entityservice.async_worker.compute_filter_similarity[97a0f714-7d45-4dc7-adeb-bd8cfa797189] raised unexpected: UnrewindableBodyError('Unable to record file position for rewinding request body during a redirect/retry.',)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
    return self.run(*args, **kwargs)
  File "/var/www/entityservice/async_worker.py", line 506, in compute_filter_similarity
    mc.fput_object(config.MINIO_BUCKET, result_filename, result_filename)
  File "/usr/lib/python3.6/site-packages/minio/api.py", line 616, in fput_object
    part_number)
  File "/usr/lib/python3.6/site-packages/minio/api.py", line 1456, in _do_put_multipart_object
    content_sha256=part_metadata.sha256hasher.hexdigest()
  File "/usr/lib/python3.6/site-packages/minio/api.py", line 1744, in _url_open
    preload_content=False)
  File "/usr/lib/python3.6/site-packages/urllib3/poolmanager.py", line 321, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 668, in urlopen
    **response_kw)
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 584, in urlopen
    body_pos = set_file_position(body, body_pos)
  File "/usr/lib/python3.6/site-packages/urllib3/util/request.py", line 83, in set_file_position
    rewind_body(body, pos)
  File "/usr/lib/python3.6/site-packages/urllib3/util/request.py", line 114, in rewind_body
    raise UnrewindableBodyError("Unable to record file position for rewinding "
urllib3.exceptions.UnrewindableBodyError: Unable to record file position for rewinding request body during a redirect/retry.
05:23:49   INFO     Received task: entityservice.async_worker.compute_filter_similarity[25551bfa-4d12-461f-a4f5-e2dc7a7a7de7]  
[2018-06-11 05:23:49,646: ERROR/MainProcess] Process 'ForkPoolWorker-1091' pid:1099 exited with 'signal 9 (SIGKILL)'
05:23:49   ERROR    Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
05:23:49   INFO     Received task: entityservice.async_worker.compute_filter_similarity[fe57dbcf-d644-40fc-9372-050b4a752786]  
05:23:51   INFO     Writing 11018405 intermediate results to file: chunk-res-42f37ffbacefd87bf7afcc73.csv
05:23:51   INFO     Writing 7012431 intermediate results to file: chunk-res-9a96494e6103b1a964d4e8fc.csv

Example 3 (Chord callback)

 06:07:13   ERROR    Chord callback for '29ccc4a7-8aa9-42ec-b65b-c0219eb540ad' raised: ChordError("Dependency e9331a92-8992-4c37-8963-9aafd994427f raised OSError('[Errno 12] Out of memory',)",)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/celery/backends/redis.py", line 289, in on_chord_part_return
    callback.delay([unpack(tup, decode) for tup in resl])
  File "/usr/lib/python3.6/site-packages/celery/backends/redis.py", line 289, in <listcomp>
    callback.delay([unpack(tup, decode) for tup in resl])
  File "/usr/lib/python3.6/site-packages/celery/backends/redis.py", line 242, in _unpack_chord_result
    raise ChordError('Dependency {0} raised {1!r}'.format(tid, retval))
celery.exceptions.ChordError: Dependency e9331a92-8992-4c37-8963-9aafd994427f raised OSError('[Errno 12] Out of memory',)
06:07:14   INFO     Task entityservice.async_worker.compute_filter_similarity[733388b0-d822-4224-b5c2-fa6cf5de742a] succeeded in 21.30179575100192s: (3620808, 'chunk-res-f39350bccfad0f2242225e3d.csv') 

Note that at this point the run's state transitions to error.

The setup logs:

02:45:59   INFO     Computing similarity for 3935514 x 3935514 entities
02:45:59   INFO     Computing similarity using greedy method
02:45:59   INFO     Chunks will contain 31623 entities per task
02:45:59   INFO     Chunking into 15625 computation tasks each with (at most) 31623 entities.
...
02:46:12   INFO     Timings: Prep: 0.4839 + 0.8999, Solve: 8.3547, Total: 9.7385  Comparisons: 1000014129
02:46:12   INFO     Timings: Prep: 0.4845 + 0.6814, Solve: 8.5892, Total: 9.7550  Comparisons: 1000014129
02:46:12   INFO     Writing 423547 intermediate results to file: chunk-res-b3842a39b74a7d6704765995.csv
02:46:12   INFO     Writing 608184 intermediate results to file: chunk-res-e1584442b00465b99440e61b.csv
...

The worker configuration:


workers:
  LARGE_COMPARISON_CHUNK_SIZE: "1000000000"
  LARGE_JOB_SIZE: "100000000000"
  MATCH_THRESHOLD: "0.95"
  MAX_CACHE_SIZE: "1000000"
  SMALL_COMPARISON_CHUNK_SIZE: "100000000"
  SMALL_JOB_SIZE: "100000000"
  celery:
    ACKS_LATE: "true"
    MAX_TASKS_PER_CHILD: "2"
    PREFETCH_MULTIPLIER: "1"
  debug: "true"
  extraArgs: {}
  image:
    pullPolicy: Always
    pullSecret: n1-quay-pull-secret
    repository: quay.io/n1analytics/entity-app
    tag: v1.8.1-feature-use-store-for-scores
  monitor:
    enabled: false
  name: matcher
  podAnnotations: {}
  replicaCount: 6
  autoscaler:
    enabled: false
  resources:
    requests:
      cpu: 6000m
      memory: 14Gi
    limits:
      memory: 25Gi
hardbyte commented 6 years ago

The first error occurred after about an hour of running, and then the worker was terminated and it was another hour until the next failure. So my guess is we have a memory leak. Green sidebar shows Error logs:

memory-issue