Closed dtitov closed 6 years ago
Making use of https://github.com/NBISweden/ega-data-api-dockerized/blob/bootstrap/lega_test.makefile (file that requires a bit of adjustments such as the inclusion of stable_id
for publish.py
) to replicate the issue was not fruitful (file was ingested correctly):
ingest-swe1_1 | [ crypto ][DEBUG ] (L197) Finalizing stream of data
ingest-swe1_1 | [ crypto ][DEBUG ] (L246) Calculated digest: 06e7eeff79f79962d89639cb5aa54d03
ingest-swe1_1 | [ crypto ][DEBUG ] (L247) Compared to digest: 06e7eeff79f79962d89639cb5aa54d03
ingest-swe1_1 | [ crypto ][ INFO ] (L262) File encrypted
ingest-swe1_1 | [ingestion ][DEBUG ] (L151) Re-encryption completed
ingest-swe1_1 | [ingestion ][DEBUG ] (L154) Reply message: {'stable_id': '1245', 'encrypted_integrity': {'checksum': '2573ac7c06312c0762083a52e4264d67', 'algorithm': 'md5'}, 'user': 'john', 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'filepath': 'dir/file.gpg', 'internal_data': {'file_id': 1, 'user_id': 'john', 'filepath': '/ega/staging/c4304fd1-7c08-5635-90b2-df645b8bc05b'}, 'status': {'state': 'PROCESSING', 'details': None}}
ingest-swe1_1 | [ amqp ][DEBUG ] (L62) Sending {'stable_id': '1245', 'encrypted_integrity': {'checksum': '2573ac7c06312c0762083a52e4264d67', 'algorithm': 'md5'}, 'user': 'john', 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'filepath': 'dir/file.gpg', 'internal_data': {'file_id': 1, 'user_id': 'john', 'filepath': '/ega/staging/c4304fd1-7c08-5635-90b2-df645b8bc05b'}, 'status': {'state': 'PROCESSING', 'details': None}} to exchange: lega [routing key: staged]
ingest-swe1_1 | [ amqp ][DEBUG ] (L107) Sending ACK for message 1 (Correlation ID: 4456ea27-0fb9-4dc9-b939-89dc08303fd4)
ega-vault-swe1 | [ amqp ][DEBUG ] (L97) Consuming message 1 (Correlation ID: 4456ea27-0fb9-4dc9-b939-89dc08303fd4)
ega-vault-swe1 | [ vault ][DEBUG ] (L36) Target: /ega/vault/000/000/000/000/000/000/01
ega-vault-swe1 | [ vault ][DEBUG ] (L42) Moving /ega/staging/c4304fd1-7c08-5635-90b2-df645b8bc05b to /ega/vault/000/000/000/000/000/000/01
ega-vault-swe1 | [ db ][DEBUG ] (L196) Setting final name for file_id 1: /ega/vault/000/000/000/000/000/000/01
ega-vault-swe1 | [ db ][DEBUG ] (L48) 30 attempts (every 1 seconds)
ega-vault-swe1 | [ db ][ INFO ] (L41) Initializing a connection to: ega-db-swe1:5432/lega
ega-vault-swe1 | [ amqp ][DEBUG ] (L62) Sending {'stable_id': '1245', 'encrypted_integrity': {'checksum': '2573ac7c06312c0762083a52e4264d67', 'algorithm': 'md5'}, 'user': 'john', 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'filepath': 'dir/file.gpg', 'internal_data': 1, 'status': {'state': 'ARCHIVED', 'message': 'File moved to the vault'}} to exchange: lega [routing key: archived]
ega-vault-swe1 | [ amqp ][DEBUG ] (L107) Sending ACK for message 1 (Correlation ID: 4456ea27-0fb9-4dc9-b939-89dc08303fd4)
ega-vault-swe1 | [ amqp ][DEBUG ] (L97) Consuming message 1 (Correlation ID: 4456ea27-0fb9-4dc9-b939-89dc08303fd4)
ega-vault-swe1 | [ verify ][DEBUG ] (L29) Verifying message: {'stable_id': '1245', 'encrypted_integrity': {'checksum': '2573ac7c06312c0762083a52e4264d67', 'algorithm': 'md5'}, 'user': 'john', 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'filepath': 'dir/file.gpg', 'internal_data': 1, 'status': {'state': 'ARCHIVED', 'message': 'File moved to the vault'}}
ega-vault-swe1 | [ db ][DEBUG ] (L48) 30 attempts (every 1 seconds)
ega-vault-swe1 | [ db ][ INFO ] (L41) Initializing a connection to: ega-db-swe1:5432/lega
ega-vault-swe1 | [ amqp ][DEBUG ] (L62) Sending {'stable_id': '1245', 'encrypted_integrity': {'checksum': '2573ac7c06312c0762083a52e4264d67', 'algorithm': 'md5'}, 'user': 'john', 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'filepath': 'dir/file.gpg', 'status': {'state': 'COMPLETED', 'details': '1245'}} to exchange: lega [routing key: completed]
ega-vault-swe1 | [ amqp ][DEBUG ] (L107) Sending ACK for message 1 (Correlation ID: 4456ea27-0fb9-4dc9-b939-89dc08303fd4)
Next step to replicate it from integration tests.
Seems to be confirmed when running from test suite:
[ crypto ][DEBUG ] (L246) Calculated digest: 903a763372900eabc520f7dc35eaed52
ingest-swe1_1 | [ crypto ][DEBUG ] (L247) Compared to digest: 06e7eeff79f79962d89639cb5aa54d03
ingest-swe1_1 | [ crypto ][ERROR ] (L255) Invalid md5 checksum for the original file
ingest-swe1_1 | [ crypto ][WARNING] (L258) Removing /ega/staging/c4304fd1-7c08-5635-90b2-df645b8bc05b
ingest-swe1_1 | [ db ][ERROR ] (L254) Exception: <class 'lega.utils.exceptions.Checksum'> in /root/.local/lib/python3.6/site-packages/lega/ingest.py on line: 149
ingest-swe1_1 | [ db ][ERROR ] (L255) Invalid md5 checksum for the original file: /ega/inbox/test/data3266561349930665832.raw.enc
ingest-swe1_1 | [ db ][DEBUG ] (L153) Setting error for 2: Invalid md5 checksum for the original file
ingest-swe1_1 | [ db ][ INFO ] (L265) Sending user error to local broker: {'user': 'test', 'filepath': 'data3266561349930665832.raw.enc', 'stable_id': 'EGAFnull', 'encrypted_integrity': {'checksum': 'ba1fa471a7f7a0837a279103be9cc30d', 'algorithm': 'md5'}, 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'status': {'state': 'ERROR', 'message': 'Invalid md5 checksum for the original file'}}
ingest-swe1_1 | [ amqp ][DEBUG ] (L39) Setting hearbeat to 0
ingest-swe1_1 | [ amqp ][ INFO ] (L51) Getting a connection to broker
ingest-swe1_1 | [ amqp ][DEBUG ] (L52) {'host': 'ega-mq-swe1', 'port': 5672, 'virtual_host': '/', 'credentials': <pika.credentials.PlainCredentials object at 0x7fba58ddfda0>, 'connection_attempts': 2, 'heartbeat_interval': 0}
ega-mq-swe1 |
ega-mq-swe1 | =INFO REPORT==== 15-May-2018::11:37:09 ===
ega-mq-swe1 | accepting AMQP connection <0.698.0> (172.20.0.6:56618 -> 172.20.0.3:5672)
ega-mq-swe1 |
ega-mq-swe1 | =INFO REPORT==== 15-May-2018::11:37:09 ===
ega-mq-swe1 | connection <0.698.0> (172.20.0.6:56618 -> 172.20.0.3:5672): user 'guest' authenticated and granted access to vhost '/'
ingest-swe1_1 | [ amqp ][DEBUG ] (L62) Sending {'user': 'test', 'filepath': 'data3266561349930665832.raw.enc', 'stable_id': 'EGAFnull', 'encrypted_integrity': {'checksum': 'ba1fa471a7f7a0837a279103be9cc30d', 'algorithm': 'md5'}, 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'status': {'state': 'ERROR', 'message': 'Invalid md5 checksum for the original file'}} to exchange: cega [routing key: files.error]
ingest-swe1_1 | [ amqp ][DEBUG ] (L107) Sending ACK for message 1 (Correlation ID: None)
After deeper investigation with @blankdots we figured out that it happens only with the newest BC or with gpg 2.2.7 (although such files can be decrypted by the same tools). Considering the fact that it's quite uncommon case (having sequence of zero-bytes in the genomic file), let's move it to the backlog for now.
Actually, the same happens if the file is filled with arbitrary bytes:
So, I'm bringing this one back from the backlog, I guess this issue should be addressed.
Empirically I detected that it fails even for random-characters file bigger than 100KB, e.g.: (https://gist.github.com/dtitov/c170ae5749697623540cd53100d0c7f3)
Files smaller than 100KBs are fine. Compression is still "Uncompressed", encryption is "AES256".
The Crypt4GH branch should solve that issue.
I'll close it, after the branch is merged into dev
Here's an example file: https://we.tl/ehhFmnDfNs
Basically, it's a 10MB file starting with "hello" (
68 65 6c 6c 6f
) and the rest is zero-bytes (00 00 ...
).This file can be encrypted and decrypted successfully by the
gpg
tool. But this file fails to be decrypted during ingestion:Invalid md5 checksum for the original file: /ega/inbox/test/data776277145814638337.raw.enc
I've tried to figure out, why does it happen and it appeared that decrypted file's length doesn't match the original file's length. Original files length: 10485760 bytes. Decrypted file length: 10486021 bytes.
I have no idea where do these extra bytes come from, but obviously, the file is not being decrypted properly.
Used encryption angorithm:
AES256
. Compression:Uncompressed
. I've tried other algorithms (likeCAST5
+ZIP
), but it also didn't work.I suspect that the problem is in zero-bytes, but still, as I mentioned,
gpg
tool manages to decrypt the file without any problems.