IndoNLP / indonlu

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models, and a starter code! (AACL-IJCNLP 2020)
https://indobenchmark.com
Apache License 2.0
554 stars 193 forks source link

Build spell checker in Bahasa Indonesia #18

Closed Balurc closed 3 years ago

Balurc commented 3 years ago

Hi, you guys are doing an amazing work!

I am working on creating a spellchecker program for Bahasa Indonesia, currently I am combining fasttext original id language model with norvig's spell check algorithm, the results are ok, but i think it can be improved further with larger and cleaner language model.

I tried your FastText (Indo4B) model, but so far it produces same results as previous one's. There are still words such as "anaak", "indonesa", etc.

Any idea on how i can do this task better? I am newbie here btw :) any advice is welcomed.

Could you please direct me to your dataset/corpus that covers formal Indonesian language?

Thanks a lot!

gentaiscool commented 3 years ago

Hi, @Balurc exciting work! And Happy new year!

I wondered to know how did you use the fasttext model in Norvig's spell check algorithm. Probably you can provide more details.

Regarding the question about the dataset, you can find it here https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/dataset/preprocessed/dataset_wot_uncased_blanklines.tar.xz; it consists of formal and informal sentences. But we combined all of them into a single dataset. It would be best if you filtered them.

Let me know if you need another help! All the best in your research!

gentaiscool commented 3 years ago

I will close this issue since there is no more active conversation. You are welcome to reopen this issue anytime.