About train data of clf.py

svjack commented 3 years ago

Hi, I review the code , and have two questions want to ask. One is about the wikidata.csv trained in clf.py The samples mainly from english and also have some tiny data from other languages, such as japanese. And have some samples use upper case. My question is that, this data used to train a classifier about the question meaning format of sql. the tensorflow hub model also trained on english only. and the question input seems always in english. Why you use some multilingual input samples and use some upper case transformations. This seems like your want to make the clf can also use in multilingual input , and want it to adapt with uppercase as some sql input . If you want do it, why you do not use multilingual embedding in tf hub ? extend the sample by some NMT translations and apply case transformations as sample augmentation methods ?

Second is that because the project construction mainly above on trained models and can use without training. the lexicon parse of question input to identify intention mainly related with some custom (pre-assigned) keywords defined in adapt methods of some inherent class of ColumnType (such as Number and Date) so this project is mainly focus on simple input questions in lexicon. So if i used it in question from other languages (such as chinese or japanese), It seems that can use some simple NMT model to translate from these language into english and use your model. without replace the keywords defined in above adapt methods. (because the lexicon is simple in question input, so the translated question should be will formed or formatted) As we all know, the schema or column name defined in database table or pandas dataframe may always in english, And the table content may from other languages. In this situation, i must have a choice in the translation the other language content. If i also translate content into english, this seems like works. If i don’t, the qa function defined in your nlp.py should use multilingual squad transformers (some roberta model)

All i want to do is to adapt this project from only english tableQA to multilingual tableQA. Because the input question and table dataset is simple in lexicon, dose this feature will support in the project in the future ?

abhijithneilabraham commented 3 years ago

Multilingual seems like a good suggestion.You need to have is the QA model and clf.py classifier to be available for multilinguals. Also, the tokenization,lemmatisation, etc are done mainly for english, probably might need to change everything for that too. So my best guess is we can adapt to languages like german, french, etc which follows similar structure to english and also could follow similar rules for the lemmatisation, tokenisation etc.

svjack commented 3 years ago

Multilingual seems like a good suggestion.You need to have is the QA model and clf.py classifier to be available for multilinguals. Also, the tokenization,lemmatisation, etc are done mainly for english, probably might need to change everything for that too. So my best guess is we can adapt to languages like german, french, etc which follows similar structure to english and also could follow similar rules for the lemmatisation, tokenisation etc.

I had try EasyNMT https://github.com/UKPLab/EasyNMT opus-mt model to transform your wikidata.csv to other languages, It seems that the translations are well formatted in intuition.

abhijithneilabraham commented 3 years ago

That's okay for wikidata.csv, but what about the QA models?

svjack commented 3 years ago

That's okay for wikidata.csv, but what about the QA models?

https://huggingface.co/deepset/xlm-roberta-large-squad2

abhijithneilabraham commented 3 years ago

Interesting. Let's give this a shot, would you be able to contribute while we're at this task? It would also be good to have some fluency in the respective language as well.

svjack commented 3 years ago

i will try to adapt it to chinese firstly.

abhijithneilabraham commented 3 years ago

Cool. Feel free to ask questions on the go. Also, the multilingual should be optional, as you already know. Like, agent.query_db(lang="chinese") likewise.

abhijithneilabraham / tableQA

About train data of clf.py #53