refactor and add exploration assigment

Assigments

Drop all duplicates.
How many characters in each line / average / total for dataset (th and zh)?
How many words in each line / average / total for dataset (th tokenized by pythainlp.tokenize; check your pythainlp version)?
How many words in each line / average / total for dataset (try zh tokenizers jieba, pkuseg, or any other ones you find interesting)?
zh-to-th word ratio in each line / average for dataset; for example, (我吃飯, ฉันกินข้าว) has 3 zh words and 3 th words so the ratio is $3/3=1$)
Find similarity score for each sentence pair and average for dataset using multilingual universal sentence encoder

LalitaDeelert / lalita-mt-zhth