How to use the public dataset

PaddlePaddle / PaddleCloud

PaddlePaddle Docker images and K8s operators for PaddleOCR/Detection developers to use on public/private cloud.

Apache License 2.0

284 stars 76 forks source link

How to use the public dataset #125

Open Yancey1989 opened 7 years ago

Yancey1989 commented 7 years ago

Paddlecloud provides some public dataset for the developer.

How to Usage

We can install a cluster_dataset python package in the runtime Docker image, and use it as:

from paddle.cloud.cluster_dataset import mnist
...

trainer.train(reader = mnist.train(), ...)

How to block the data leakage

Because of developers can upload a trainer python package to the PaddleCloud, so I think the most effective way to block the data leakage is blocking all connections of Kubernetes nodes to the exteranl internal.

typhoonzero commented 7 years ago

This is a security issue, here is several things to consider about:

Can nodes/pods access the internet? Users may need to download dependencies or public data, but also being able to upload public datasets to other places by injecting some code in the reader(). Maybe no internet access from nodes/pods is a good idea.
Users can save training output models to their own cloud storage space and then download the models. This is another possible vulnerability, users can inject code to reader() to save the original data directly to the storage space. We can avoid this by the following ways:
1. Validate the user uploaded program to find.
2. Encrypt public datasets and store the key in a secure place on cloud which can only be read by reader()
We need to know if their's attack occurred. This may be really hard, we can starting from inspecting network bandwidth to inference the unusual traffics.
About network policies. We don't have network policies currently, so one user can sniff around the network or connect to any opened ports in the whole cluster. This may not lead to data leakage directly but still a problem.

wangkuiyi commented 7 years ago

It seems that after networking control, there leaves the following potential leakage path:

User programs should be able to read the data,
User programs should be able to write data to CephFS, and
Users are allowed to upload and download data from CephFS.

How could we prevent data leakage alone above pipeline?

typhoonzero commented 7 years ago

As mentioned above in No.2:

Encrypt public datasets and store the key in a secure place on cloud which can only be read by reader()

More details:

Users must use a special reader to pass public datasets to trainer:

...
trainer.train(reader=paddle.datasets.public.sample.train())
# Users can also use part of the feature columns or filters to get a reader:
trainer.train(reader=paddle.datasets.public.sample.train(fields=[3,4,5], filter=some_func))
...

This reader returns encrypted data which is decrypted by DataProviderConverter or in the c++ side. We need to implement a encrypt tool to encrypt data and upload them to cloud and then implement functions to decrypt, decrypting can not be accessed by users.

Store the encrypt key as a secret storage in kubernetes and keep it secretly.

helinwang commented 7 years ago

Do we allow user to use custom built Paddle? If so, user can easily access the decrypted data by writing a custom layer.

Yancey1989 commented 7 years ago

Do we allow user to use custom built Paddle? If so, user can easily access the decrypted data by writing a custom layer.

Good point, if we allow the user to use a custom Paddle binary file, he/she can always print the decrypted data, @typhoonzero and I discussed this question at yesterday, maybe prevent custom Paddle binary files and prevent custom runtime Docker image is a good choice.