NBISweden / LocalEGA

Please go to to https://github.com/EGA-archive/LocalEGA instead
Apache License 2.0
4 stars 1 forks source link

Re-encryption module is not able to properly decrypt some files #297

Closed dtitov closed 6 years ago

dtitov commented 6 years ago

Here's an example file: https://we.tl/ehhFmnDfNs

Basically, it's a 10MB file starting with "hello" (68 65 6c 6c 6f) and the rest is zero-bytes (00 00 ...).

This file can be encrypted and decrypted successfully by the gpg tool. But this file fails to be decrypted during ingestion: Invalid md5 checksum for the original file: /ega/inbox/test/data776277145814638337.raw.enc

I've tried to figure out, why does it happen and it appeared that decrypted file's length doesn't match the original file's length. Original files length: 10485760 bytes. Decrypted file length: 10486021 bytes.

I have no idea where do these extra bytes come from, but obviously, the file is not being decrypted properly.

Used encryption angorithm: AES256. Compression: Uncompressed. I've tried other algorithms (like CAST5 + ZIP), but it also didn't work.

I suspect that the problem is in zero-bytes, but still, as I mentioned, gpg tool manages to decrypt the file without any problems.

blankdots commented 6 years ago

Making use of https://github.com/NBISweden/ega-data-api-dockerized/blob/bootstrap/lega_test.makefile (file that requires a bit of adjustments such as the inclusion of stable_id for publish.py) to replicate the issue was not fruitful (file was ingested correctly):

ingest-swe1_1  | [  crypto  ][DEBUG ] (L197) Finalizing stream of data
ingest-swe1_1  | [  crypto  ][DEBUG ] (L246) Calculated digest: 06e7eeff79f79962d89639cb5aa54d03
ingest-swe1_1  | [  crypto  ][DEBUG ] (L247) Compared to digest: 06e7eeff79f79962d89639cb5aa54d03
ingest-swe1_1  | [  crypto  ][ INFO ] (L262) File encrypted
ingest-swe1_1  | [ingestion ][DEBUG ] (L151) Re-encryption completed
ingest-swe1_1  | [ingestion ][DEBUG ] (L154) Reply message: {'stable_id': '1245', 'encrypted_integrity': {'checksum': '2573ac7c06312c0762083a52e4264d67', 'algorithm': 'md5'}, 'user': 'john', 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'filepath': 'dir/file.gpg', 'internal_data': {'file_id': 1, 'user_id': 'john', 'filepath': '/ega/staging/c4304fd1-7c08-5635-90b2-df645b8bc05b'}, 'status': {'state': 'PROCESSING', 'details': None}}
ingest-swe1_1  | [   amqp   ][DEBUG ] (L62) Sending {'stable_id': '1245', 'encrypted_integrity': {'checksum': '2573ac7c06312c0762083a52e4264d67', 'algorithm': 'md5'}, 'user': 'john', 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'filepath': 'dir/file.gpg', 'internal_data': {'file_id': 1, 'user_id': 'john', 'filepath': '/ega/staging/c4304fd1-7c08-5635-90b2-df645b8bc05b'}, 'status': {'state': 'PROCESSING', 'details': None}} to exchange: lega [routing key: staged]
ingest-swe1_1  | [   amqp   ][DEBUG ] (L107) Sending ACK for message 1 (Correlation ID: 4456ea27-0fb9-4dc9-b939-89dc08303fd4)
ega-vault-swe1 | [   amqp   ][DEBUG ] (L97) Consuming message 1 (Correlation ID: 4456ea27-0fb9-4dc9-b939-89dc08303fd4)
ega-vault-swe1 | [  vault   ][DEBUG ] (L36) Target: /ega/vault/000/000/000/000/000/000/01
ega-vault-swe1 | [  vault   ][DEBUG ] (L42) Moving /ega/staging/c4304fd1-7c08-5635-90b2-df645b8bc05b to /ega/vault/000/000/000/000/000/000/01
ega-vault-swe1 | [    db    ][DEBUG ] (L196) Setting final name for file_id 1: /ega/vault/000/000/000/000/000/000/01
ega-vault-swe1 | [    db    ][DEBUG ] (L48) 30 attempts (every 1 seconds)
ega-vault-swe1 | [    db    ][ INFO ] (L41) Initializing a connection to: ega-db-swe1:5432/lega
ega-vault-swe1 | [   amqp   ][DEBUG ] (L62) Sending {'stable_id': '1245', 'encrypted_integrity': {'checksum': '2573ac7c06312c0762083a52e4264d67', 'algorithm': 'md5'}, 'user': 'john', 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'filepath': 'dir/file.gpg', 'internal_data': 1, 'status': {'state': 'ARCHIVED', 'message': 'File moved to the vault'}} to exchange: lega [routing key: archived]
ega-vault-swe1 | [   amqp   ][DEBUG ] (L107) Sending ACK for message 1 (Correlation ID: 4456ea27-0fb9-4dc9-b939-89dc08303fd4)
ega-vault-swe1 | [   amqp   ][DEBUG ] (L97) Consuming message 1 (Correlation ID: 4456ea27-0fb9-4dc9-b939-89dc08303fd4)
ega-vault-swe1 | [  verify  ][DEBUG ] (L29) Verifying message: {'stable_id': '1245', 'encrypted_integrity': {'checksum': '2573ac7c06312c0762083a52e4264d67', 'algorithm': 'md5'}, 'user': 'john', 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'filepath': 'dir/file.gpg', 'internal_data': 1, 'status': {'state': 'ARCHIVED', 'message': 'File moved to the vault'}}
ega-vault-swe1 | [    db    ][DEBUG ] (L48) 30 attempts (every 1 seconds)
ega-vault-swe1 | [    db    ][ INFO ] (L41) Initializing a connection to: ega-db-swe1:5432/lega
ega-vault-swe1 | [   amqp   ][DEBUG ] (L62) Sending {'stable_id': '1245', 'encrypted_integrity': {'checksum': '2573ac7c06312c0762083a52e4264d67', 'algorithm': 'md5'}, 'user': 'john', 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'filepath': 'dir/file.gpg', 'status': {'state': 'COMPLETED', 'details': '1245'}} to exchange: lega [routing key: completed]
ega-vault-swe1 | [   amqp   ][DEBUG ] (L107) Sending ACK for message 1 (Correlation ID: 4456ea27-0fb9-4dc9-b939-89dc08303fd4)

Next step to replicate it from integration tests.

blankdots commented 6 years ago

Seems to be confirmed when running from test suite:

[  crypto  ][DEBUG ] (L246) Calculated digest: 903a763372900eabc520f7dc35eaed52
ingest-swe1_1  | [  crypto  ][DEBUG ] (L247) Compared to digest: 06e7eeff79f79962d89639cb5aa54d03
ingest-swe1_1  | [  crypto  ][ERROR ] (L255) Invalid md5 checksum for the original file
ingest-swe1_1  | [  crypto  ][WARNING] (L258) Removing /ega/staging/c4304fd1-7c08-5635-90b2-df645b8bc05b
ingest-swe1_1  | [    db    ][ERROR ] (L254) Exception: <class 'lega.utils.exceptions.Checksum'> in /root/.local/lib/python3.6/site-packages/lega/ingest.py on line: 149
ingest-swe1_1  | [    db    ][ERROR ] (L255) Invalid md5 checksum for the original file: /ega/inbox/test/data3266561349930665832.raw.enc
ingest-swe1_1  | [    db    ][DEBUG ] (L153) Setting error for 2: Invalid md5 checksum for the original file
ingest-swe1_1  | [    db    ][ INFO ] (L265) Sending user error to local broker: {'user': 'test', 'filepath': 'data3266561349930665832.raw.enc', 'stable_id': 'EGAFnull', 'encrypted_integrity': {'checksum': 'ba1fa471a7f7a0837a279103be9cc30d', 'algorithm': 'md5'}, 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'status': {'state': 'ERROR', 'message': 'Invalid md5 checksum for the original file'}}
ingest-swe1_1  | [   amqp   ][DEBUG ] (L39) Setting hearbeat to 0
ingest-swe1_1  | [   amqp   ][ INFO ] (L51) Getting a connection to broker
ingest-swe1_1  | [   amqp   ][DEBUG ] (L52) {'host': 'ega-mq-swe1', 'port': 5672, 'virtual_host': '/', 'credentials': <pika.credentials.PlainCredentials object at 0x7fba58ddfda0>, 'connection_attempts': 2, 'heartbeat_interval': 0}
ega-mq-swe1    | 
ega-mq-swe1    | =INFO REPORT==== 15-May-2018::11:37:09 ===
ega-mq-swe1    | accepting AMQP connection <0.698.0> (172.20.0.6:56618 -> 172.20.0.3:5672)
ega-mq-swe1    | 
ega-mq-swe1    | =INFO REPORT==== 15-May-2018::11:37:09 ===
ega-mq-swe1    | connection <0.698.0> (172.20.0.6:56618 -> 172.20.0.3:5672): user 'guest' authenticated and granted access to vhost '/'
ingest-swe1_1  | [   amqp   ][DEBUG ] (L62) Sending {'user': 'test', 'filepath': 'data3266561349930665832.raw.enc', 'stable_id': 'EGAFnull', 'encrypted_integrity': {'checksum': 'ba1fa471a7f7a0837a279103be9cc30d', 'algorithm': 'md5'}, 'unencrypted_integrity': {'checksum': '06e7eeff79f79962d89639cb5aa54d03', 'algorithm': 'md5'}, 'status': {'state': 'ERROR', 'message': 'Invalid md5 checksum for the original file'}} to exchange: cega [routing key: files.error]
ingest-swe1_1  | [   amqp   ][DEBUG ] (L107) Sending ACK for message 1 (Correlation ID: None)
dtitov commented 6 years ago

After deeper investigation with @blankdots we figured out that it happens only with the newest BC or with gpg 2.2.7 (although such files can be decrypted by the same tools). Considering the fact that it's quite uncommon case (having sequence of zero-bytes in the genomic file), let's move it to the backlog for now.

dtitov commented 6 years ago

Actually, the same happens if the file is filled with arbitrary bytes:

So, I'm bringing this one back from the backlog, I guess this issue should be addressed.

dtitov commented 6 years ago

Empirically I detected that it fails even for random-characters file bigger than 100KB, e.g.: (https://gist.github.com/dtitov/c170ae5749697623540cd53100d0c7f3)

Files smaller than 100KBs are fine. Compression is still "Uncompressed", encryption is "AES256".

silverdaz commented 6 years ago

The Crypt4GH branch should solve that issue. I'll close it, after the branch is merged into dev