deeppavlov / deeppavlov-gsoc-ideas

4 stars 7 forks source link

Multilinguality demo #11

Closed oserikov closed 3 years ago

oserikov commented 3 years ago

difficulty: challenging mentor: @oserikov requirements: python, NLP

Different languages sperakers do use DeepPavlov. By design, we are independent of any particular language. But since our main models are built and tested only for English and Russian, multilinguality stays a raw prototype. Building a complete pipeline on new languages would be a great Idea.

potato-patata commented 3 years ago

We can use LASER embeddings (by Facebook). It works by transforming incoming language into a language-agnostic vector in a space, where all languages for the same input would point to the same area. That is to say, any incoming phrases with the same meaning would map to the same area in latent space. (https://towardsdatascience.com/multilingual-sentence-models-in-nlp-476f1f246d2f).

oserikov commented 3 years ago

Hey! LASER is a great idea ! I will post some further details here ~ in the next 24hrs

arya2910 commented 3 years ago

@oserikov My name is Arya Gupta and I am 2nd year computer science with specialization artificial intelligence undergraduate student at Medicaps University Indore, India. I am really interested in this project and looking forward to contribute to this project under GSOC. It would be great assistance if you could suggest how to start with this project.

oserikov commented 3 years ago

@potato-patata , I'd some thoughts on the LASER models. First, classical LASER is shipped as a python package. It means, it would require some coding to integrate LASER into deeppavlov codebase. On the other hand, this holds for lots of approaches. So it probably woth it to take a look at how are embedding techniques implemented in DP.

oserikov commented 3 years ago

@arya2910 hey! First things first, you probably should learn how deeppavlov pipelines work. Multilinguality techniques moslty rely onto word and sentence embedding techniques, so it's probably good idea to pay attention to DP embedding components. Another direction here is to dive deeper into the ASR|TTS stuff

potato-patata commented 3 years ago

@oserikov Thank you for the reply, i will have a look at this and get back to you.

potato-patata commented 3 years ago

@oserikov is there any way we can reduce the dependencies for config files when running on local machine?

tathagata-raha commented 3 years ago

Hi @oserikov , I contacted you on Telegram regarding multilingual models in Indian languages. I proposed that IndicBERT could be used for the multilingual models for Indian languages.

arya2910 commented 3 years ago

hey @danielkornev why this issue is closed . Are we not going to work on this project??

danielkornev commented 3 years ago

(duplicate of response in TG group)

Hi Arya!

This is the first year of our participation in GSoC. While we'd be thrilled to have contributions done across all of the areas of our projects, we feel it'd be better to focus on things that are closer to the core of our projects.

We appreciate interest in multilingualism, and, in a way, we support it with multilingual BERT, however, we'd like to focus attention on more important issues.

However, if you want to work on such project like this outside of GSoC, we certainly welcome you to become one of our contributors, make a PR, and get your contribution shipped as part of our library!