OpenMined / SyferText

A privacy preserving NLP framework
Apache License 2.0
198 stars 49 forks source link

pipeline local caching #187

Closed AlanAboudib closed 3 years ago

AlanAboudib commented 4 years ago

Feature Description

When a pipeline is loaded from PyGrid, we should be able to save the states of the pipe component to which the local worker has access to.

Saving can be done on a call to nlp.save(destination = 'local') or nlp.save() which consider destination = 'local' as the default.

The reason we introduce the destination argument is that I think would will need other data lake storage types in the future such as s3 to back up pipelines.

When loading a model with `nlp.load(pipeline_name = 'syfertext_sentiment', cache = Union[None, 'local', 's3'])

For this issue, onlyimplementation of 'local' and None are required.

Notice that, for this to work properly, each pipeline should have a version, and a timestamp in its tags. If the cache version and timestamp does not corresponds to the PyGrid one, the cache is not used.

Is your feature request related to a problem?

Pipeline loading might be time consuming. It would be impractical to load the same pipeline several times during testing.

Additional Context

merge to the syfertext_0.1.0 branch

hershd23 commented 4 years ago

Hi Alan, I had already started some work on this, but was later notified by Nilansh that the PyGrid nodes weren't working properly, so I dropped it temporarily. I'd like to take this one again

AlanAboudib commented 4 years ago

It is yours @hershd23

hershd23 commented 4 years ago

@AlanAboudib , @Nilanshrajput I have already used pickle and dill and they both throw an error as to that the pipeline object is of a type they can't save. Any fixes?

Error :- Can't pickle <class 'syft.frameworks.torch.hook.hook.TorchHook._hook_worker_methods.<locals>.Torch'>: it's not found as syft.frameworks.torch.hook.hook.TorchHook._hook_worker_methods.<locals>.Torch

Nilanshrajput commented 4 years ago

@hershd23 check the simplify function, save what that function returns and while loading use its detail function for each object that you are trying to save.

AlanAboudib commented 4 years ago

@hershd23 the Pipeline object does not contain the states of the individual pipes. you should cache it in the same way it is deployed. simplified Pipeline object + simplified State objects belonging to that pipeline. If the local worker does not have necessary permissions to download the state (due to restricted access property of the State object) then that state is not cached