Decrypting HUGE files - Githubissues

silverdaz commented 7 years ago

Here is my current use-case: Huge files (GB or more) must be PGP-decrypted and immediately re-encrypted. Currently, I'm using subprocess, call GnuPG and pipe the output to my python process for re-encryption.

GnuPG decrypts block by block, so what I receive in the pipe is sent to a generator for re-encryption.

I stay therefore within memory bounds, even for huge files, and decryption+rencryption run in parallel.

As far as I understand, I can not do that with PGPy, since it uses bytearray to store the message content internally. Moreover, in the case of huge files to decrypt, it blows up the memory.

It's a show-stopper for me at this point. Would this be possible to update the bytearray(os.path.getsize(filepath)) into a generator?

We could then have key.decrypt(msg, sink=None or file-like-obj-or-pipe) if sink is None, use the old bytearray. If not, use the file-like obj-or-pipe with read and write api.

I'm not that new to Python, but the library is fairly complex, so you would have better chances than me at fixing that issue.

Commod0re commented 7 years ago

So, this is essentially a +1 for #139, which also depends on completing #95.

Would this be possible to update the bytearray(os.path.getsize(filepath)) into a generator?

That is actually similar to what I have in mind for supporting streaming crypto, and I have a rough idea of how to implement it. I just need to make some time to sit down and figure out the details. I want to get this functionality in place for 0.5.0

silverdaz commented 7 years ago

def chunker(stream, chunk_size=None):
    """Lazy function (generator) to read a stream one chunk at a time."""

    if not chunk_size:
        chunk_size = 1 << 26 # 67 MB or 2**26
    assert(chunk_size >= 16)
    yield chunk_size
    while True:
        data = stream.read(chunk_size)
        if not data:
            return None # No more data
        yield data

and then

def decrypt_engine(key, passphrase):
    '''Generator that takes a block of data as input and decrypts it as output (using PGP).'''

    assert( isinstance(key,PGPKey) )
    print('Starting the (de)cipher engine') # or log
    with key.unlocked(passphrase): # raise Exception on wrong passphrase
        cipherchunk = yield
        while True:
            cipherchunk = yield key.decrypt(cipherchunk)

Now the chunker can feed the decrypt_engine...

Is that what you have in mind? Does it help?

silverdaz commented 7 years ago

I have not read the PGP file format to know where to cut the stream, and what chunks to send to the decrypt_engine. But you probably have that in the top of your head, right?

juhtornr commented 7 years ago

The problem in PGPy is that entire message content is stored into the memory using bytearray. This simply doesn’t work if the file size is bigger than you have memory. One possible solution is to use memory map and let the kernel handle memory allocation:

https://docs.python.org/3/library/mmap.html

The needed changes are very minor:

diff --git a/pgpy/types.py b/pgpy/types.py
index b4a3d71..c13c65b 100644
--- a/pgpy/types.py
+++ b/pgpy/types.py
@@ -13,6 +13,7 @@ import os
 import re
 import warnings
 import weakref
+import mmap

 from enum import EnumMeta
 from enum import IntEnum
@@ -185,8 +186,8 @@ class Armorable(six.with_metaclass(abc.ABCMeta)):
     def from_file(cls, filename):
         with open(filename, 'rb') as file:
             obj = cls()
-            data = bytearray(os.path.getsize(filename))
-            file.readinto(data)
+            m = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
+            data = bytearray(m)

         po = obj.parse(data)

With this it’s possible to decrypt as big files as you want:

[tornroos:~/elixir/pgp]$ ls -lh big_test_file.gpg 
-rw-r--r--  1 tornroos  staff    32G Sep 18 14:59 big_test_file.gpg
[tornroos:~/elixir/pgp]$ cat decrypt.py 
import pgpy

TEST_FILE = 'big_test_file.gpg'
PRIVATE_KEY = 'private.key'
PASSPHRASE = 'foobar'

key, _ = pgpy.PGPKey.from_file(PRIVATE_KEY)

with key.unlock(PASSPHRASE):
    message = pgpy.PGPMessage.from_file(TEST_FILE)
    decrypted_message = key.decrypt(message).message.decode("utf-8")

[tornroos:~/elixir/pgp]$ /usr/bin/time -l /usr/local/bin/python3 decrypt.py 
      414.64 real        24.66 user       110.94 sys
7013019648  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
  19225541  page reclaims
   8393470  page faults
         0  swaps
         1  block input operations
        20  block output operations
         0  messages sent
         0  messages received
         0  signals received
    103799  voluntary context switches
   1795588  involuntary context switches

Without the fix this sample program just dies when you run out of memory. Please merge this into the next release of PGPy.

Commod0re commented 7 years ago

I'm not merging that without testing it thoroughly, and I'm not convinced that change alone is actually useful for anyone, because all it does is move the point of running out of memory to sometime during packet parsing, rather than up front when the buffer is allocated.

I'd much rather fail fast in that case, because it's absolutely going to fail anyway if the message is actually that large.

Commod0re commented 7 years ago

The real tricky part of this is not so much "how do we read a file that is too big to fit into memory" but "how do we provide access to an encrypted blob that is too big to fit into memory, and still allow meaningful access to its contents in a way that does not simply result in blowing up memory at a later point rather than an earlier one"

silverdaz commented 6 years ago

Any progress on that issue? Is the streaming part, as advised above, implemented?

I would like to avoid writing my own library just for that particular case when there is already a lot of work done here!

Commod0re commented 6 years ago

I promise I'm working on this, my available time for working on PGPy is just limited right now and I'm trying to wrap up the 0.4.4 bugfix release first.

nbarraille commented 6 years ago

Any news on this?

cedalexandre commented 4 years ago

Did you find out a solution ?

adiazma commented 3 years ago

Did you find out a solution ?

edison-orium commented 1 year ago

1 year later..

sgupta-keypath commented 1 year ago

Any progress on this issue?

erickeniuk commented 1 year ago

mmap might make the i/o more efficient and you could maybe squeeze in a larger file, but I'd imagine you'd still run out of memory.

Not sure if GPG allows for files to be broken down into chunks and decrypt that way, but that seems like the ideal solution. Some sort of automated chunking based on the key used.

kmbeyond commented 4 days ago

Hi friends, I am seeing this same error with 0.6.0 for a pgp encrypted data file using assymmetric key (not just big file). The pgp file looks good, I am able to decrypt it using gpg --decrypt command.

Is there a workaround or solution in Python to be used in automated job?

This is my error for command:

pgpy.PGPMessage.from_file( '/path/data_file.csv.pgp' )

Traceback (most recent call last): File "", line 1, in File "/xxx/python/venv2/lib/python3.9/site-packages/pgpy/types.py", line 195, in from_blob po = obj.parse(bytearray(blob, 'latin-1')) File "/xxx/python/venv2/lib/python3.9/site-packages/pgpy/pgp.py", line 1293, in parse self |= Packet(data) File "/xxx/python/venv2/lib/python3.9/site-packages/pgpy/pgp.py", line 1080, in or raise NotImplementedError(str(type(other))) NotImplementedError: <class 'pgpy.packet.types.Opaque'>

SecurityInnovation / PGPy

Decrypting HUGE files #216