Hamza5 / Pipeline-diacritizer

Automatic Arabic diacritics restoration tool.
MIT License
16 stars 3 forks source link

How do I apply the tashkeela dataset to train the system? #1

Closed BarnabasSzabolcs closed 4 years ago

BarnabasSzabolcs commented 4 years ago

Hi! Your code looks like a very nice work, something that's thought through thoroughly. And it looks like you've made special preprocessing functionality for the Tashkeela dataset. However, I'm not sure how to feed the Tashkeela dataset to train your algorithm.

Can you please help me? Thank you, Barney

Hamza5 commented 4 years ago

Hi,

The training is done using the preprocessed version of the Tashkeela dataset, not the original.

Basically, you can just train using this command using the default parameters:

$ pipeline_diacritizer train --train-data tashkeela_train.txt --val tashkeela_val.txt

The weights will be updated on every epoch (and it will take a very long time to complete one epoch because the dataset is huge, you can remove some lines from it to make it smaller).

The number of epochs is fixed to 15 by default, but you can change this to another value by setting the parameter --iterations to a different value, for example --iterations 30 for 30 iterations.

The weight files are stored in a subdirectory called Tashkeela_params in the current directory. You can change this by passing a different path to the parameter --weights-dir.

There is another less important parameter called --early-stop which has a value of 3 by default. It is used just to stop the training before all the iterations are completed, just in case, the model has already the optimal weights. It this is not convenient, you can set it to higher values than 3.

BarnabasSzabolcs commented 4 years ago

Thank you, Hamza for the descriptions! I'll download the preprocessed version and try to apply the commands as you've described.

Let me suggest that it may be well worth to upload a trained version of the system (or even to sell it as saas). A good arabic diacritics restoration algorithm is difficult to come by and even google does not offer one for sale. I'm also yet to find a dialectal pronunciation algorithm which is what I'm most interested in - I even tried to work my way back from online-latin arabic to arabic script but the online-latin system is incredibly inconsistent... as if arabic dialect didn't really have a clear concept of vowels which I'm sure they do have.

Hamza5 commented 4 years ago

You are welcome. I am planning to upload a trained version, but currently, I can't because I left the partial weights in a server, and it is turned off from 3 months ago, so I am waiting for the administration to turn it on so I can access again to my data. But for the service, I think that is not what I will do, because this code is really slow and resource-intensive. However, I have to keep it because it is realized as a companion for a scientific article that will be published soon. I am planning to rewrite the code and make another repository for a better version.

macriluke commented 4 years ago

Many thanks for providing the preprocessed data. Approximately how long can training be expected to take to complete 15 epochs?

Hamza5 commented 4 years ago

Many thanks for providing the preprocessed data. Approximately how long can training be expected to take to complete 15 epochs?

My code isn't well optimized, and the Tashkeela dataset is really huge, so it will take very long time. On the machine that I was using during the research, which has 32 CPU units and 4 GTX 1080Ti GPUs, it took approximately 24 hours per epoch! Maybe you can reduce the data if you want to see quicker results.