averkij / lingtrain-aligner

Lingtrain Aligner — ML powered library for the accurate texts alignment.
GNU General Public License v3.0
123 stars 9 forks source link

Add text splitting into small parts #3

Open AigizK opened 3 years ago

AigizK commented 3 years ago

The current version ignores the H1-H5 headers that were added by user. But when book was translate text from chapter 1 will be translate as a chapter 1 text into another language. You can use this fact and split a big text to small parts.

Next idea - try split a big text to small blocks automatically: Select a few sentences from original text(for example 10 sentences) and using loop try to find translate block in the thanslated text.

You can use the next psedocode:

left_array = original_sentences[100:110]
sum=[]
for i=50;i<150 do:
   right_array_candidate=translated_sentences[i:i+10]
   sum[i]=sum(cosunuse_distance(left_array,right_array_candidate))

rigth_array=get_index_with_max_value(sum)

left_text_split_index=left_array[0]
rigth_text_split_index=rigth_array[0]