Open silverdaz opened 7 years ago
So, this is essentially a +1 for #139, which also depends on completing #95.
Would this be possible to update the bytearray(os.path.getsize(filepath)) into a generator?
That is actually similar to what I have in mind for supporting streaming crypto, and I have a rough idea of how to implement it. I just need to make some time to sit down and figure out the details. I want to get this functionality in place for 0.5.0
def chunker(stream, chunk_size=None):
"""Lazy function (generator) to read a stream one chunk at a time."""
if not chunk_size:
chunk_size = 1 << 26 # 67 MB or 2**26
assert(chunk_size >= 16)
yield chunk_size
while True:
data = stream.read(chunk_size)
if not data:
return None # No more data
yield data
and then
def decrypt_engine(key, passphrase):
'''Generator that takes a block of data as input and decrypts it as output (using PGP).'''
assert( isinstance(key,PGPKey) )
print('Starting the (de)cipher engine') # or log
with key.unlocked(passphrase): # raise Exception on wrong passphrase
cipherchunk = yield
while True:
cipherchunk = yield key.decrypt(cipherchunk)
Now the chunker can feed the decrypt_engine...
Is that what you have in mind? Does it help?
I have not read the PGP file format to know where to cut the stream, and what chunks to send to the decrypt_engine. But you probably have that in the top of your head, right?
The problem in PGPy is that entire message content is stored into the memory using bytearray. This simply doesn’t work if the file size is bigger than you have memory. One possible solution is to use memory map and let the kernel handle memory allocation:
https://docs.python.org/3/library/mmap.html
The needed changes are very minor:
diff --git a/pgpy/types.py b/pgpy/types.py
index b4a3d71..c13c65b 100644
--- a/pgpy/types.py
+++ b/pgpy/types.py
@@ -13,6 +13,7 @@ import os
import re
import warnings
import weakref
+import mmap
from enum import EnumMeta
from enum import IntEnum
@@ -185,8 +186,8 @@ class Armorable(six.with_metaclass(abc.ABCMeta)):
def from_file(cls, filename):
with open(filename, 'rb') as file:
obj = cls()
- data = bytearray(os.path.getsize(filename))
- file.readinto(data)
+ m = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
+ data = bytearray(m)
po = obj.parse(data)
With this it’s possible to decrypt as big files as you want:
[tornroos:~/elixir/pgp]$ ls -lh big_test_file.gpg
-rw-r--r-- 1 tornroos staff 32G Sep 18 14:59 big_test_file.gpg
[tornroos:~/elixir/pgp]$ cat decrypt.py
import pgpy
TEST_FILE = 'big_test_file.gpg'
PRIVATE_KEY = 'private.key'
PASSPHRASE = 'foobar'
key, _ = pgpy.PGPKey.from_file(PRIVATE_KEY)
with key.unlock(PASSPHRASE):
message = pgpy.PGPMessage.from_file(TEST_FILE)
decrypted_message = key.decrypt(message).message.decode("utf-8")
[tornroos:~/elixir/pgp]$ /usr/bin/time -l /usr/local/bin/python3 decrypt.py
414.64 real 24.66 user 110.94 sys
7013019648 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
19225541 page reclaims
8393470 page faults
0 swaps
1 block input operations
20 block output operations
0 messages sent
0 messages received
0 signals received
103799 voluntary context switches
1795588 involuntary context switches
Without the fix this sample program just dies when you run out of memory. Please merge this into the next release of PGPy.
I'm not merging that without testing it thoroughly, and I'm not convinced that change alone is actually useful for anyone, because all it does is move the point of running out of memory to sometime during packet parsing, rather than up front when the buffer is allocated.
I'd much rather fail fast in that case, because it's absolutely going to fail anyway if the message is actually that large.
The real tricky part of this is not so much "how do we read a file that is too big to fit into memory" but "how do we provide access to an encrypted blob that is too big to fit into memory, and still allow meaningful access to its contents in a way that does not simply result in blowing up memory at a later point rather than an earlier one"
Any progress on that issue? Is the streaming part, as advised above, implemented?
I would like to avoid writing my own library just for that particular case when there is already a lot of work done here!
I promise I'm working on this, my available time for working on PGPy is just limited right now and I'm trying to wrap up the 0.4.4 bugfix release first.
Any news on this?
Did you find out a solution ?
Did you find out a solution ?
1 year later..
Any progress on this issue?
mmap might make the i/o more efficient and you could maybe squeeze in a larger file, but I'd imagine you'd still run out of memory.
Not sure if GPG allows for files to be broken down into chunks and decrypt that way, but that seems like the ideal solution. Some sort of automated chunking based on the key used.
Hi friends, I am seeing this same error with 0.6.0 for a pgp encrypted data file using assymmetric key (not just big file). The pgp file looks good, I am able to decrypt it using gpg --decrypt command.
Is there a workaround or solution in Python to be used in automated job?
This is my error for command:
pgpy.PGPMessage.from_file( '/path/data_file.csv.pgp' )
Traceback (most recent call last): File "", line 1, in File "/xxx/python/venv2/lib/python3.9/site-packages/pgpy/types.py", line 195, in from_blob po = obj.parse(bytearray(blob, 'latin-1')) File "/xxx/python/venv2/lib/python3.9/site-packages/pgpy/pgp.py", line 1293, in parse self |= Packet(data) File "/xxx/python/venv2/lib/python3.9/site-packages/pgpy/pgp.py", line 1080, in or raise NotImplementedError(str(type(other))) NotImplementedError: <class 'pgpy.packet.types.Opaque'>
Here is my current use-case: Huge files (GB or more) must be PGP-decrypted and immediately re-encrypted. Currently, I'm using subprocess, call GnuPG and pipe the output to my python process for re-encryption.
GnuPG decrypts block by block, so what I receive in the pipe is sent to a generator for re-encryption.
I stay therefore within memory bounds, even for huge files, and decryption+rencryption run in parallel.
As far as I understand, I can not do that with PGPy, since it uses
bytearray
to store the message content internally. Moreover, in the case of huge files to decrypt, it blows up the memory.It's a show-stopper for me at this point. Would this be possible to update the
bytearray(os.path.getsize(filepath))
into a generator?We could then have
key.decrypt(msg, sink=None or file-like-obj-or-pipe)
if sink is None, use the old bytearray. If not, use the file-like obj-or-pipe with read and write api.I'm not that new to Python, but the library is fairly complex, so you would have better chances than me at fixing that issue.