Open Yancey1989 opened 7 years ago
This is a security issue, here is several things to consider about:
reader()
. Maybe no internet access from nodes/pods is a good idea.reader()
to save the original data directly to the storage space. We can avoid this by the following ways:
reader()
It seems that after networking control, there leaves the following potential leakage path:
How could we prevent data leakage alone above pipeline?
As mentioned above in No.2:
Encrypt public datasets and store the key in a secure place on cloud which can only be read by reader()
More details:
Users must use a special reader to pass public datasets to trainer:
...
trainer.train(reader=paddle.datasets.public.sample.train())
# Users can also use part of the feature columns or filters to get a reader:
trainer.train(reader=paddle.datasets.public.sample.train(fields=[3,4,5], filter=some_func))
...
This reader returns encrypted data which is decrypted by DataProviderConverter
or in the c++ side. We need to implement a encrypt tool to encrypt data and upload them to cloud and then implement functions to decrypt, decrypting can not be accessed by users.
Store the encrypt key as a secret
storage in kubernetes and keep it secretly.
Do we allow user to use custom built Paddle? If so, user can easily access the decrypted data by writing a custom layer.
Do we allow user to use custom built Paddle? If so, user can easily access the decrypted data by writing a custom layer.
Good point, if we allow the user to use a custom Paddle binary file, he/she can always print the decrypted data, @typhoonzero and I discussed this question at yesterday, maybe prevent custom Paddle binary files and prevent custom runtime Docker image is a good choice.
Paddlecloud provides some public dataset for the developer.
How to Usage
We can install a
cluster_dataset
python package in the runtime Docker image, and use it as:How to block the data leakage
Because of developers can upload a trainer python package to the PaddleCloud, so I think the most effective way to block the data leakage is blocking all connections of Kubernetes nodes to the exteranl internal.