Open broadstack-au opened 1 year ago
Use: https://cryptography.io/en/latest/ It'd be ideal to have a specific mechanism for key rotation Will need to apply a size limit to the values, 16Kb should be fine as an initial setting
Add encrypt/decrypt functions to Entry
class using Cryptography - Fernet
Key is stored separately -> passed in as argument
Two options, using *args
:
No args -> all fields
Pass in fields as args -> encrypt given fields
for field_name, value in self.values.items(): self.values[field_name] = encrypt_value(self.key, value)
Will cause issues with validation though
Data is only stored in datatables temporarily, so we can keep unencrypted data in datatables before marshalling
Only encrypt when marshalling
Validate (+ 16k size limit) -> encrypt -> marshal
Note: Marshmallow structures and Mongo(?) docs would have to change to accept encrypted token (base64url string) as data type
Cryptography has native support for key rotation (Multifernet)
blob
instead of text
and assume all blob
fields contain encrypted data
Unsure about Setuptools and implementing as an extra, will need to examine docs
Not worked with threading/aio before (especially w/ Python), will definitely look into it
Not worked with threading/aio before (especially w/ Python), will definitely look into it
On consideration - It probably makes more sense to leave that detail up to the implementing app - if they need to offload certain tasks, then they can work out how to do it.
We need to look at an asymmetric solution first. If this lib is used in an isolated multi-tenanted environment, we want to be sure we have a way to keep the data flow one-way (this lib encrypts, the implementing system decrypts).
I think it'll be relatively straight forward to have multiple options for how the encryption is handled.
If we introduce SchemaFieldFormat.encrypted()
we could map its _field
to a class which encapsulates the "original" marshmallow field and can deal with the encrypted data.
Something like....
class SchemaFieldFormat:
...
encrypted: typing.Optional[str] = field(default=False)
_get_mapped_fieldclass(self) -> typing.Callable:
_mapped_type = SCHEMAFIELD_MAP[self.type]
if not self.encrypted:
return _mapped_type()
return EncryptedSchemaField(type=_mapped_type, method=self.encrypted)
So the validated flow would be
[input] -> EncryptedSchemaField.validate() -> OriginalField.validate()
Serializing the data would be
[value] -> EncryptedSchemaField._serialize() -> something.encrypt(value).base64encode()
meaning each field value would encrypted, or not, without having to change the storage mechanism.
Need to work out where the public key/secret would get injected/stored.
Ah I see, so we're only really concerned with data going from some interface to persistent storage for now - in which case an asymmetric solution makes sense. In that case, I assume it would be up to the implementing system to decide how/where to store the private key. The public key would be handled by this lib though, but since it is public, it can be stored anywhere really.
Additionally, if we are not concerned with decryption, than I guess we don't have to worry about determining whether a value pulled from the database is encrypted or not. I assume the implementing system would be able to do that by knowing that certain fields will inherently be encrypted by way of the nature of their data (e.g. we know phone will be encrypted while name will not).
I realise now that I had a misunderstanding and there is no need to alter the storage mechanism as formatting only really affects data validation.
I am however a little confused about the EncryptedSchemaField
class you suggested. To my understanding, the process of serializing data involves using the export()
function of Entry
and collating each export into a List
. Then use schema_to_marshmallow()
to create a Marshmallow schema based off of the entry's schemata. Then use the dump()
method of the Marshmallow schema to serialize the list of exported entries.
If this is the case, and also assuming we only encrypt the data when it is serialized, would this mean encryption should be handled by Entry
's export()
method?
Ah I see, so we're only really concerned with data going from some interface to persistent storage for now - in which case an asymmetric solution makes sense. In that case, I assume it would be up to the implementing system to decide how/where to store the private key. The public key would be handled by this lib though, but since it is public, it can be stored anywhere really.
Yeah, I think that's about right. So long as we're providing the mechanism, and understand that the field becomes useless after serialization (because we won't know what's in it any more), then the rest is up to "the other thing" to deal with.
They want to change the field value? Sure - give us a new value to replace it with. In this context, we just need to treat the field appropriately (validate it on entry, and just honour the existing value from that point).
I am however a little confused about the EncryptedSchemaField class you suggested. To my understanding, the process of serializing data involves using the export() function of Entry and collating each export into a List. Then use schema_to_marshmallow() to create a Marshmallow schema based off of the entry's schemata. Then use the dump() method of the Marshmallow schema to serialize the list of exported entries.
If this is the case, and also assuming we only encrypt the data when it is serialized, would this mean encryption should be handled by Entry's export() method?
Fair Q. The usage examples only show the entry using the schema as a validation tool, so there's no real scope for the field itself to change the values. If we change the way Schema.process_values
works, so that it actually processes the values and returns the dump()
output back, then the EncryptedSchemaField
could do its work and then Entry
can remain dumb (which I'd prefer - it's just a storage class really).
To clarify a key goal - We need to ensure the value is encrypted for any particular entry value (i.e. regardless of the storage mechanism). Which means we need to encrypt for dump()
, yes, but I'd also like the system to accomodate loading the field back in, identifying it as an encrypted value and just leave it as is (if possible? IDK? Can possibly just be lazy and say "if b64encoded, already encrypted").
The purpose of the EncryptedSchemaField
would then be;
EncryptedSchemaField
that extends fields.Email
would allow the input data to be validated as submitted with minimal effort, whilst also ensuring we can guarantee that encryption will happen.Ah ok, this is becoming much clearer now - that seems quite logical.
To clarify, each Entry
would hold a set of values, each aligning with a particular SchemaField
/EncryptedSchemaField
, which would be responsible for validating each value, instead of using the Marshmallow schema for validation. The schema would be able to validate that an Entry
adheres to schema-specific rules (i.e. required fields and whatnot). I assume since Schema.process_values
will handle dump()
-ing, SchemaField
/EncryptedSchemaField
would be used to simply pull the value from, where SchemaField
will return the value as-is and EncryptedSchemaField
would return the value encrypted.
At least this makes sense to me, and I imagine it should be fairly simple to implement (lol).
At least this makes sense to me, and I imagine it should be fairly simple to implement (lol).
Hahaha... now you've doomed yourself 😄
Next steps - now that you have a better idea of the what and how, can you write it up here in plain english as a set of steps that you'd follow to implement it as if it was a test like this?
Something along the lines of this (but with more words / steps / specificity / etc)
# add field to <model> to store <key thingy>
# tell the field it's holding "email, encrypted"
# check validation
# perform serialisation and make sure we get encrypted value
# ...etc
Hahaha... now you've doomed yourself 😄
😬
Apologies for the delays, uni's back in full swing again so I've been a bit preoccupied - anyways here's the implementation plan:
# import cryptography, rsa
# add encrypted flag to SchemaField
# add variable to hold encrypted value
# add method to SchemaField produce ciphertext from _value using public key upon initialisation/set if encrypted == true
# add step in initialisation to check if value is encrypted (is b64encoded?) --> set encryptedvalue variable
# modify export to return encrypted value if encrypted == true
# for each value in the dict set the value of respective field and export to get data to be serialised (encrypted will return ciphertext)
# collate the exported values into a new dict
# use the Marshmallow schema to serialise the new dict
# create persons schema
# add field for name
# add field field for email with encrypted=true
# create persons table and add persons schema - not necessarily needed for this test but we'll do it anyway for completeness
# create a person entry with invalid name and email
# validate the entry vales
# check the two errors are present after validation
# modify entry to have valid values
# validate entry values again
# check there are no errors after validation
# process the person entry (Schema.process_values() - will serialise)
# check email field contains ciphertext and not plain-text email
# load the serialised data from earlier into a SchemaField object
# verify the encryptedvalue variable contains the encrypted ciphertext
- (haven't decided exactly how the public key will be stored - maybe just add a field to Schema? idk)
Probably a combo of table and schema. The table works to corral the entries into an organised structure, the schema provides the structure for the entry. Tables can have more than one schema linked in, as can entries.
So, based on your example, maybe this would work...
table.encryption = {
type: "rsa", key: "blahblahpublickeyblahblah"}
}
table.schemas = [
Schema(fields: [{
name: "PII",
encrypted: true
}])
]
...then...
entry: Entry = table.new_entry() # helper func so the entry conforms to table and doesn't need individual schema defs
entry.values = {
"PII": "This is my address"
}
table.process_entry(entry)
and the table can push its key into the schema for encryption purposes, so that
class Table:
...
def process_entry(entry: Entry):
_merged_schema = self.get_merged_schemata()
_merged_schema.validate(entry.values)
_merged_schema.process_values(entry.values) # magic happens here
results in entry.values['pii']
being encrypted by the field or schema - tbd.
That would keep it flexible enough that the implementor could keep the key in table, and/or use it directly with the schema.
To help ensure we can store PII in datatables and preserve the standard schema/value validation processes, we need a mechanism that can either;
Currently, encryption is managed externally, (via
blob
fields). Biggest downside is the offloading of validation to the application providing/receiving the data.We must not own/control
We should provide options to
The system should know
Goal
Considerations
Encryption is expensive - consider whether including ?threading/aio/other? would be sensible.