LalitaDeelert / lalita-mt-zhth

Apache License 2.0
4 stars 0 forks source link

refactor and add exploration assigment #1

Closed cstorm125 closed 3 years ago

cstorm125 commented 3 years ago

Assigments

  1. Drop all duplicates.
  2. How many characters in each line / average / total for dataset (th and zh)?
  3. How many words in each line / average / total for dataset (th tokenized by pythainlp.tokenize; check your pythainlp version)?
  4. How many words in each line / average / total for dataset (try zh tokenizers jieba, pkuseg, or any other ones you find interesting)?
  5. zh-to-th word ratio in each line / average for dataset; for example, (我吃飯, ฉันกินข้าว) has 3 zh words and 3 th words so the ratio is $3/3=1$)
  6. Find similarity score for each sentence pair and average for dataset using multilingual universal sentence encoder