amaiya / ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply
Apache License 2.0
1.23k stars 269 forks source link

Other languages #329

Closed jacob-hein closed 3 years ago

jacob-hein commented 3 years ago

How adaptable is ktrain for say SQuAD like QA with a BERT model pretrained in another language besides English?

amaiya commented 3 years ago

Here are three options to do non-English QA in ktrain:

I've only tried the third option.

jacob-hein commented 3 years ago

I used your module to create an index of N news Danish articles and applied my fine-tuned Danish BERT QA model with these.

And ktrain works surprisingly well for just N=500.

However, I'm finding that the call text.SimpleQA.index_from_folder(folder_path=ARTICLES_DIR, index_dir=INDEXDIR) takes quite a some time to run (~8 minutes) for just 500 news articles, each of them saved in a .txt file at the folder_path.

Ideally, I would want to create an index of roughly N=100,000 articles, however I fear index_from_folder() is somewhat of a bottleneck for this objective. These articles take up about 200 mb when stored in a .json format.

Is this an realistic objective for the purpose of this module?

If yes, how can I optimize the index creation with this many articles?

amaiya commented 3 years ago

It should not take that long for just 500 news articles.

  1. Did you follow the tutorial? How long does the call to index_from_folder take when you run the tutorial exactly as shown?
  2. What is the exact command you're using for index_from_folder? Are you using the same parameters as in the tutorial?