[GSoC] Project: Building a Biomedical Language Model from Multiple Private Datasets

AlanAboudib commented 4 years ago

Title: Building a language model from private biomedical datasets using SyferText Mentor: Alan Aboudib Level: Intermediate

Introduction

SyferText is a NLP library that is being built with the aim of preserving dataset and model's privacy. In order to enforce these constraints, SyferText leverages PySyft to provide the ability to perform federated encrypted training and prediction.

In fact, SyferText's main mission is to provide a framework that enables blind preprocessing of text. For example, given a client, with a private text dataset, with no authorization to access the plain text, SyferText makes it possible to create special modules such as tokenizers that can be sent to the machine where the private dataset resides, and tokenize each sentence in the dataset remotely without any one violating data access constraints. SyferText would virtually allow to send any module that needs to perform preprocessing blindly on plain text and prepare it for encrypted training.

Once the text representation is encrypted, another advantage of SyferText comes into play: training batches containing a mix of encrypted examples coming from multiple private remote datasets could be created to train a single model. This way, multiple private dataset could be abstracted as a single bigger dataset. This has the obvious potential of creating better models. Such models would then be encapsulated in a SyferText model pipeline that can be dumped and reused.

GSoC Task

Usually, when a new text dataset is encountered, the first step is to make sure a language model for that dataset is available. A minimal language model would include embedding vectors for each word in the dataset. Many NLP libraries provide off-the-shelf language models that can be used without going into the hassle of creating one's own language model. However, for some specific tasks, one might need to build their own language model in case no adapted models are available. But what if the dataset is private? Even worse, what if the language model is to be created for multiple private datasets belonging to different clients? SyferText to the Rescue!

The minimal goal of this project is to use SyferText to create a language model for biomedical text data. We are going to use a publicly available dataset to simulate multiple private datasets living on different PySyft workers. Then, SyferText and PySyft will by used to preprocess those dataset and prepare them to train a model to create word embeddings. The created embeddings should be then packaged into a language model that we will make available for all SyferText users.

While performing this task, missing features of SyferText will be discovered and built, and new interesting follow-up tasks are going to be identified such as including a pipeline of specialized modules such as parsers, taggers or named entity recognizer in the package model.

Working on these tasks is not limited by the GSoC deadline, so if you are interested in a long-term contribution to SyferText, this is a good starting point.

Required skills:

Knowledge in deep learning in general and NLP in particular.
Familiarity with PySyft and privacy-preserving concepts and tools it provides.
Python programming skills.
Familiarity with deep learning frameworks in general and PyTorch in particular.
Experience in working with biomedical datasets is a bonus.

Apply

If you are excited about this project, and would like to contribute in case it is selected for GSoC, please do the follow:

Add a comment below to express your interest.
Let me know more about you by filling out this form https://forms.gle/4vrijKhMoQ52LadU8

Useful Materials

1- SyferText tutorials (including remote tokenization) https://github.com/OpenMined/SyferText/tree/master/tutorials

2- Example of biomedical language models for spaCy: https://allenai.github.io/scispacy/

ratmcu commented 4 years ago

Hello Alan, I am finishing my Master Thesis in Privacy sensitive data and I created an end to end automation of training an NER model, from dataset creation to training. My last milestone is remote execution of BERT model I centrally trained using pysyft. And I have almost finished it. I would like to extend my experience through working in this SyferText baser LM.

praveenjoshi01 commented 4 years ago

Hello Alan,

The project sounds really interesting. One more point that I think which can be aligned to the project is - We can try to build one domain-based (healthcare at the moment) knowledgebase (ontology) based on the data publically available which can help us to put more importance to certain aspects of data while making the Language model. I think it really helps any classification or identification task once we create and cipher it.

I am interested in carrying forward this task and to work upon it for a longer run :).

Nilanshrajput commented 4 years ago

Hello Alan, SyferText looks very interesting, and making language for private medical data sounds really good. And I would love to contribute in building new features for SyferText along the way.

AlanAboudib commented 4 years ago

Hi @Nilanshrajput , @praveenjoshi01 and @ratmcu if you are interested in applying to this project, you need to create a proposal and submit it on the GSoC website.

Notice that if this project is not selected for GSoC, or if it was retained but you were not selected, this does not mean that you cannot work on it. Working on a different dataset, subproblem would be always welcome

praveenjoshi01 commented 4 years ago

@AlanAboudib - I think it will be great if we can have a skype meeting and where we can discuss the proposal and direction of the project.

sachin-101 commented 4 years ago

Hi @AlanAboudib, I would love to contribute to this project and help build SyferText.

Working on these tasks is not limited by the GSoC deadline, so if you are interested in a long-term contribution to SyferText, this is a good starting point.

Excited for long-term contribution.

sachin-101 commented 4 years ago

@AlanAboudib @praveenjoshi01 We can have a slack channel for discussion.

AlanAboudib commented 4 years ago

I created the channel #gsoc_syfertext_lang_model

OpenMined / SyferText