Rerun training data with punctuation and see if there is a difference.

ayya-vimala commented 3 years ago

New training data can be found here: https://github.com/BuddhaNexus/buddhanexus-utils/tree/master/plien

For the application of the real pali texts I had a few thoughts. First of all, the pali texts in the segmented-pali repo folder inputfiles is the original segmentation, which is not the one used by us on BuddhaNexus because many segments are too long. I have therefore created two more folders:

The folder inputfiles holds the files with their original segmentation, including longer segments with multiple sentences. The folder inputfiles_cut_segments_for_Aijato holds the files with longer segments cut according to ., ?, ; and : so as to make them usable for the Aijato automated pali --> english translations. The folder inputfiles_cut_segments_for_NN holds the files where longer segments are cut down even further, breaking on , as well. For the calculation of matches, smaller segments are preferable.

So I suggest to use the Aijato folder for the application of the pali -- > english application and if needed the NN folder for calculating matches.

Then another thing I realized is that despite the frequent occurances of some passages like Anguttara Nikaya, the system can still not translate this properly. Maybe because duplicate entries have been removed. So how about keeping the duplicate entries in?

sebastian-nehrdich commented 3 years ago

Yes a few notes on this from my side:

for neural machine translation, shorter sentence pairs (pli-eng) are always better than longer. the system cannot deal with sentences longer than 200 characters very well
regarding duplicates, we can try this. Another thing we can do is data augmentation (i.e. randomly combining different sentences to create new examples), I will check it out
rerun with punctuation included should be trivial

ayya-vimala commented 3 years ago

I have just uploaded new training data with punctuation but without numbers in it. I had a look at the segments there but as the English and the Pali don't always match with regards to punctuation I did not reduce the segments to smaller. Quite often there is a longer pali sentence where Bhante Sujato has translated in various smaller sentences. I can run something to see how many are bigger than 200 characters.

ayya-vimala commented 3 years ago

There are 2162 matches across 8 files of pali sentences longer than 250 characters in the training data. Maybe we need to discuss what to do with that.

ayya-vimala commented 3 years ago

I've now broken up training data in sentences where segments contained several sentences (where possible) so there are far a couple of thousand more training sentences now and they are on average much shorter. Training data also lacks all numbers now.

The data that can be used after the training of the network in English is complete is segmented-pali/inputdata_cut_segments_for_Aijato. This is cut on real sentences.

Then for the data that can be used for calculating matches (segmented-pali/inputdata_cut_segments_for_NN) there are hardly any segments any more over 200 characters. If they do this is because there are concatted words or a lot of rubbish like numbers.

BuddhaNexus / buddhanexus-frontend

Rerun training data with punctuation and see if there is a difference. #368