Provide a mechanism to encrypt/decrypt the Entry values

broadstack-au commented 1 year ago

To help ensure we can store PII in datatables and preserve the standard schema/value validation processes, we need a mechanism that can either;

encrypt/decrypt specific fields based on a provided key & algorithm
encrypt/decrypt all fields based on a provided key & algorithm

Currently, encryption is managed externally, (via blob fields). Biggest downside is the offloading of validation to the application providing/receiving the data.

We must not own/control

The key
The mechanism itself

We should provide options to

Encrypt all values
Encrypt some values

The system should know

Which of the above options is used and act accordingly

Goal

Allow native interactions with schemas and entries whilst ensuring the data is decrypted / encrypted appropriately when marshalled/unmarshalling

Considerations

~~Encryption is expensive - consider whether including ?threading/aio/other? would be sensible.~~
Implement as extras in the build

broadstack-au commented 1 year ago

Use: https://cryptography.io/en/latest/ It'd be ideal to have a specific mechanism for key rotation Will need to apply a size limit to the values, 16Kb should be fine as an initial setting

aaam1t commented 1 year ago

Plan for encryption/decryption mechanism

Add encrypt/decrypt functions to Entry class using Cryptography - Fernet
- Key is stored separately -> passed in as argument
- Two options, using *args:
  - No args -> all fields
  - Pass in fields as args -> encrypt given fields
  - for field_name, value in self.values.items(): self.values[field_name] = encrypt_value(self.key, value)
- Will cause issues with validation though
  - Data is only stored in datatables temporarily, so we can keep unencrypted data in datatables before marshalling
  - Only encrypt when marshalling
  - Validate (+ 16k size limit) -> encrypt -> marshal
  - Note: Marshmallow structures and Mongo(?) docs would have to change to accept encrypted token (base64url string) as data type
Cryptography has native support for key rotation (Multifernet)
- Built-in key rotation method
- Upon encryption, keep track of which fields have been encrypted -> rotate these fields
  - Either add another field to record which are encrypted
  - Or store as blob instead of text and assume all blob fields contain encrypted data
    - Could cause issues with storing other types of data

Blockers/points requiring further info

Unsure about Setuptools and implementing as an extra, will need to examine docs
Not worked with threading/aio before (especially w/ Python), will definitely look into it
- From what i've read, async writes are generally not recommended -> probably shying away from aio for now

broadstack-au commented 1 year ago

Not worked with threading/aio before (especially w/ Python), will definitely look into it

On consideration - It probably makes more sense to leave that detail up to the implementing app - if they need to offload certain tasks, then they can work out how to do it.

broadstack-au commented 1 year ago

We need to look at an asymmetric solution first. If this lib is used in an isolated multi-tenanted environment, we want to be sure we have a way to keep the data flow one-way (this lib encrypts, the implementing system decrypts).

I think it'll be relatively straight forward to have multiple options for how the encryption is handled. If we introduce SchemaFieldFormat.encrypted() we could map its _field to a class which encapsulates the "original" marshmallow field and can deal with the encrypted data.

Something like....

class SchemaFieldFormat:
...
  encrypted: typing.Optional[str] = field(default=False)

  _get_mapped_fieldclass(self) -> typing.Callable:
    _mapped_type = SCHEMAFIELD_MAP[self.type]
    if not self.encrypted:
      return _mapped_type()
    return EncryptedSchemaField(type=_mapped_type, method=self.encrypted)

So the validated flow would be [input] -> EncryptedSchemaField.validate() -> OriginalField.validate() Serializing the data would be [value] -> EncryptedSchemaField._serialize() -> something.encrypt(value).base64encode()

meaning each field value would encrypted, or not, without having to change the storage mechanism.

Need to work out where the public key/secret would get injected/stored.

aaam1t commented 1 year ago

Ah I see, so we're only really concerned with data going from some interface to persistent storage for now - in which case an asymmetric solution makes sense. In that case, I assume it would be up to the implementing system to decide how/where to store the private key. The public key would be handled by this lib though, but since it is public, it can be stored anywhere really.

Additionally, if we are not concerned with decryption, than I guess we don't have to worry about determining whether a value pulled from the database is encrypted or not. I assume the implementing system would be able to do that by knowing that certain fields will inherently be encrypted by way of the nature of their data (e.g. we know phone will be encrypted while name will not).

I realise now that I had a misunderstanding and there is no need to alter the storage mechanism as formatting only really affects data validation.

I am however a little confused about the EncryptedSchemaField class you suggested. To my understanding, the process of serializing data involves using the export() function of Entry and collating each export into a List. Then use schema_to_marshmallow() to create a Marshmallow schema based off of the entry's schemata. Then use the dump() method of the Marshmallow schema to serialize the list of exported entries.

If this is the case, and also assuming we only encrypt the data when it is serialized, would this mean encryption should be handled by Entry's export() method?

broadstack-au commented 1 year ago

Ah I see, so we're only really concerned with data going from some interface to persistent storage for now - in which case an asymmetric solution makes sense. In that case, I assume it would be up to the implementing system to decide how/where to store the private key. The public key would be handled by this lib though, but since it is public, it can be stored anywhere really.

Yeah, I think that's about right. So long as we're providing the mechanism, and understand that the field becomes useless after serialization (because we won't know what's in it any more), then the rest is up to "the other thing" to deal with.

They want to change the field value? Sure - give us a new value to replace it with. In this context, we just need to treat the field appropriately (validate it on entry, and just honour the existing value from that point).

I am however a little confused about the EncryptedSchemaField class you suggested. To my understanding, the process of serializing data involves using the export() function of Entry and collating each export into a List. Then use schema_to_marshmallow() to create a Marshmallow schema based off of the entry's schemata. Then use the dump() method of the Marshmallow schema to serialize the list of exported entries.

If this is the case, and also assuming we only encrypt the data when it is serialized, would this mean encryption should be handled by Entry's export() method?

Fair Q. The usage examples only show the entry using the schema as a validation tool, so there's no real scope for the field itself to change the values. If we change the way Schema.process_values works, so that it actually processes the values and returns the dump() output back, then the EncryptedSchemaField could do its work and then Entry can remain dumb (which I'd prefer - it's just a storage class really).

To clarify a key goal - We need to ensure the value is encrypted for any particular entry value (i.e. regardless of the storage mechanism). Which means we need to encrypt for dump(), yes, but I'd also like the system to accomodate loading the field back in, identifying it as an encrypted value and just leave it as is (if possible? IDK? Can possibly just be lazy and say "if b64encoded, already encrypted").

The purpose of the EncryptedSchemaField would then be;

if the input value is already encrypted, we can trust that prior validation has already occurred, and we just pass the value through as is (it becomes a literal pass-through field, data in / data out).
If it's not, we need to encrypt it, but first we need to validate the input. This is the point of having a field type that extends an existing field. An EncryptedSchemaField that extends fields.Email would allow the input data to be validated as submitted with minimal effort, whilst also ensuring we can guarantee that encryption will happen.

aaam1t commented 1 year ago

Ah ok, this is becoming much clearer now - that seems quite logical.

To clarify, each Entry would hold a set of values, each aligning with a particular SchemaField/EncryptedSchemaField, which would be responsible for validating each value, instead of using the Marshmallow schema for validation. The schema would be able to validate that an Entry adheres to schema-specific rules (i.e. required fields and whatnot). I assume since Schema.process_values will handle dump()-ing, SchemaField/EncryptedSchemaField would be used to simply pull the value from, where SchemaField will return the value as-is and EncryptedSchemaField would return the value encrypted.

At least this makes sense to me, and I imagine it should be fairly simple to implement (lol).

broadstack-au commented 1 year ago

At least this makes sense to me, and I imagine it should be fairly simple to implement (lol).

Hahaha... now you've doomed yourself 😄

Next steps - now that you have a better idea of the what and how, can you write it up here in plain english as a set of steps that you'd follow to implement it as if it was a test like this?

Something along the lines of this (but with more words / steps / specificity / etc)

 # add field to <model> to store <key thingy>
 # tell the field it's holding "email, encrypted"
 # check validation
 # perform serialisation and make sure we get encrypted value
 # ...etc

aaam1t commented 1 year ago

Hahaha... now you've doomed yourself 😄

😬

Apologies for the delays, uni's back in full swing again so I've been a bit preoccupied - anyways here's the implementation plan:

schema.py:

# import cryptography, rsa
# add encrypted flag to SchemaField
# add variable to hold encrypted value
# add method to SchemaField produce ciphertext from _value using public key upon initialisation/set if encrypted == true
# add step in initialisation to check if value is encrypted (is b64encoded?) --> set encryptedvalue variable
# modify export to return encrypted value if encrypted == true

this could also be done with a separate EncryptedSchemaField class instead of adding flags to the existing SchemaField, but I felt this was easier
we also have to modify Schema.process_values()

# for each value in the dict set the value of respective field and export to get data to be serialised (encrypted will return ciphertext)
# collate the exported values into a new dict
# use the Marshmallow schema to serialise the new dict

(haven't decided exactly how the public key will be stored - maybe just add a field to Schema? idk)
since each SchemaField in a Schema has a SchemaFieldFormat attribute which in turn has a Marshmallow field attribute, validation should already be sorted by the Marshmallow field

test:

# create persons schema
# add field for name
# add field field for email with encrypted=true
# create persons table and add persons schema - not necessarily needed for this test but we'll do it anyway for completeness

# create a person entry with invalid name and email
# validate the entry vales
# check the two errors are present after validation

# modify entry to have valid values
# validate entry values again
# check there are no errors after validation

# process the person entry (Schema.process_values() - will serialise)
# check email field contains ciphertext and not plain-text email

and we can also check that it works the other way

# load the serialised data from earlier into a SchemaField object
# verify the encryptedvalue variable contains the encrypted ciphertext

broadstack-au commented 1 year ago

(haven't decided exactly how the public key will be stored - maybe just add a field to Schema? idk)

Probably a combo of table and schema. The table works to corral the entries into an organised structure, the schema provides the structure for the entry. Tables can have more than one schema linked in, as can entries.

So, based on your example, maybe this would work...

table.encryption = {
    type: "rsa", key: "blahblahpublickeyblahblah"}
}
table.schemas = [
  Schema(fields: [{
    name: "PII",
    encrypted: true
  }])
]

...then...

entry: Entry = table.new_entry() # helper func so the entry conforms to table and doesn't need individual schema defs
entry.values = {
  "PII": "This is my address"
}
table.process_entry(entry)

and the table can push its key into the schema for encryption purposes, so that

class Table:
...
  def process_entry(entry: Entry):
    _merged_schema = self.get_merged_schemata()
    _merged_schema.validate(entry.values)
    _merged_schema.process_values(entry.values) # magic happens here

results in entry.values['pii'] being encrypted by the field or schema - tbd.

That would keep it flexible enough that the implementor could keep the key in table, and/or use it directly with the schema.

broadstack-com-au / bstk-datatables