DataBiosphere / getm

Concurrent reads of HTTP URL addressed data
MIT License
6 stars 1 forks source link

Failure to end process or retry upon connection reset error #135

Open abaumann opened 5 months ago

abaumann commented 5 months ago

We have a reproducible test case from a collaborator trying to use getm within Terra for TCGA data - this happens transiently, but can be reproduced when trying to download at scale from TCGA buckets. I am not sure why it happens as there are simply Google signed urls, but getm doesn't handle the error gracefully, leaving a process open and causing the WDL task to hang indefinitely (costing $ for the user).

In order to help resolve this I wanted to check if the following could be updated in getm:

Thank you!

Here is output from getm running in very verbose logging mode which shows the error:

DEBUG:getm.cli:{"progress_class": "ProgressLogger"}
DEBUG:getm.cli:{"message": "validating manifest", "manifest": [{"url": "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/a03305a8-902d-41a2-8791-08c0dcaed0f7/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=alexbaumann-1911%40dcf-prod.iam.gserviceaccount.com%2F20240129%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240129T200441Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=4c006e811411f156a76...682b", "filepath": "/cromwell_root", "checksum": "92cf29e2761443eb247354da91bdc9f7", "checksum-algorithm": "md5"}, {"url": "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/c05728b2-70bd-4a9b-a6d6-abc6127f9b5a/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bai?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=alexbaumann-1911%40dcf-prod.iam.gserviceaccount.com%2F20240129%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240129T200443Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=4eb6a3cc2bbbb14a428ff80a8c74e3f...bd8", "filepath": "/cromwell_root", "checksum": "1b0e4d4571f949df5e266218a1af2ad4", "checksum-algorithm": "md5"}], "schema": {"type": "array", "items": {"type": "object", "properties": {"url": {"type": "string"}, "filepath": {"type": "string"}, "checksum": {"type": "string"}, "checksum-algorithm": {"type": "string", "enum": ["md5", "gs_crc32c", "s3_etag", "null"]}}, "required": ["url"], "dependencies": {"checksum": ["checksum-algorithm"], "checksum-algorithm": ["checksum"]}}}}
DEBUG:getm.cli:{"multipart_buffer_size": 104857600}
INFO:getm.cli:{"message": "initiating multipart download", "url": "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/a03305a8-902d-41a2-8791-08c0dcaed0f7/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=alexbaumann-1911%40dcf-prod.iam.gserviceaccount.com%2F20240129%2Fauto%2Fs...357dd225636cde32eef08510e01144c226b2ad682b", "expected_checksum": "92cf29e2761443eb247354da91bdc9f7", "checksum_algorithm": "md5"}
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 104.4MiB/s 2.453076s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 102.0MiB/s 5.017502s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 109.7MiB/s 7.001775s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 115.2MiB/s 8.891281s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 115.2MiB/s 11.107452s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 110.4MiB/s 13.914369s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 106.4MiB/s 16.840150s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 103.6MiB/s 19.760566s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 101.5MiB/s 22.693003s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 100.0MiB/s 25.596672s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 98.7MiB/s 28.521763s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 97.7MiB/s 31.447519s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 96.9MiB/s 34.359779s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 95.2MiB/s 37.633454s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 90.8MiB/s 42.295202s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 88.2MiB/s 46.439717s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 89.7MiB/s 48.525774s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 91.2MiB/s 50.532638s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.5MiB/s 52.563134s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 93.3MiB/s 54.882982s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 93.1MiB/s 57.742125s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.9MiB/s 60.629910s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.6MiB/s 63.552936s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.3MiB/s 66.548761s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 92.0MiB/s 69.555787s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.7MiB/s 72.567698s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.5MiB/s 75.553901s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.3MiB/s 78.505067s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.1MiB/s 81.459932s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.1MiB/s 84.325711s
Process URLReaderKeepAlive-1:1:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 444, in _error_catcher
yield
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 567, in read
data = self._fp_read(amt) if not fp_closed else b""
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 525, in _fp_read
data = self._fp.read(chunk_amt)
File "/usr/lib/python3.9/http/client.py", line 458, in read
n = self.readinto(b)
File "/usr/lib/python3.9/http/client.py", line 502, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.9/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.9/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/getm/reader.py", line 190, in run
bytes_read = handle.readinto(buf[stop: stop + self.chunk_size])
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 737, in readinto
temp = self.read(len(b))
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 593, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 461, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/dist-packages/getm/reader.py", line 194, in run
buf.stop = stop
File "/usr/local/lib/python3.9/dist-packages/getm/concurrent/buffers.py", line 90, in __exit__
self.close()
File "/usr/local/lib/python3.9/dist-packages/getm/concurrent/buffers.py", line 82, in close
sm.close()
File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
self._mmap.close()
BufferError: cannot close exported pointers exist
Exception ignored in: <function SharedMemory.__del__ at 0x7e5ca92ba550>
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 184, in __del__
self.close()
File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
self._mmap.close()
BufferError: cannot close exported pointers exist
aednichols commented 2 months ago

@xbrianh if you're still active around these parts, would love to get your thoughts. We're seeing this issue in production Terra fairly often. Thanks!

aednichols commented 2 months ago

cc @benedictpaten

xbrianh commented 2 months ago

Hi @aednichols It’s been a while, but I’ll try to take a look this weekend.

xbrianh commented 2 months ago

@abaumann @aednichols @benedictpaten

I've created a draft PR that addressed some of the ungraceful exit issues raised by @abaumann. This is by no means ready to merge but it should point you in the right direction, especially this snippet.

I'd be open to a brief discussion with whomever maintains this repo, just for old times sake :)