Closed jacob-hein closed 3 years ago
Here are three options to do non-English QA in ktrain:
SimpleQA
constructor. ktrain.text.translation
module to translate corpus from your language to English and submit the questions in English.I've only tried the third option.
I used your module to create an index of N news Danish articles and applied my fine-tuned Danish BERT QA model with these.
And ktrain
works surprisingly well for just N=500.
However, I'm finding that the call text.SimpleQA.index_from_folder(folder_path=ARTICLES_DIR, index_dir=INDEXDIR)
takes quite a some time to run (~8 minutes) for just 500 news articles, each of them saved in a .txt
file at the folder_path
.
Ideally, I would want to create an index of roughly N=100,000 articles, however I fear index_from_folder()
is somewhat of a bottleneck for this objective. These articles take up about 200 mb when stored in a .json
format.
Is this an realistic objective for the purpose of this module?
If yes, how can I optimize the index creation with this many articles?
It should not take that long for just 500 news articles.
index_from_folder
take when you run the tutorial exactly as shown?index_from_folder
? Are you using the same parameters as in the tutorial?
How adaptable is ktrain for say SQuAD like QA with a BERT model pretrained in another language besides English?