Closed ayya-vimala closed 3 years ago
Yes a few notes on this from my side:
I have just uploaded new training data with punctuation but without numbers in it. I had a look at the segments there but as the English and the Pali don't always match with regards to punctuation I did not reduce the segments to smaller. Quite often there is a longer pali sentence where Bhante Sujato has translated in various smaller sentences. I can run something to see how many are bigger than 200 characters.
There are 2162 matches across 8 files of pali sentences longer than 250 characters in the training data. Maybe we need to discuss what to do with that.
I've now broken up training data in sentences where segments contained several sentences (where possible) so there are far a couple of thousand more training sentences now and they are on average much shorter. Training data also lacks all numbers now.
The data that can be used after the training of the network in English is complete is segmented-pali/inputdata_cut_segments_for_Aijato. This is cut on real sentences.
Then for the data that can be used for calculating matches (segmented-pali/inputdata_cut_segments_for_NN) there are hardly any segments any more over 200 characters. If they do this is because there are concatted words or a lot of rubbish like numbers.
New training data can be found here: https://github.com/BuddhaNexus/buddhanexus-utils/tree/master/plien
For the application of the real pali texts I had a few thoughts. First of all, the pali texts in the segmented-pali repo folder
inputfiles
is the original segmentation, which is not the one used by us on BuddhaNexus because many segments are too long. I have therefore created two more folders:So I suggest to use the Aijato folder for the application of the pali -- > english application and if needed the NN folder for calculating matches.
Then another thing I realized is that despite the frequent occurances of some passages like
Anguttara Nikaya
, the system can still not translate this properly. Maybe because duplicate entries have been removed. So how about keeping the duplicate entries in?