issues
search
LalitaDeelert
/
lalita-mt-zhth
Apache License 2.0
4
stars
0
forks
source link
refactor and add exploration assigment
#1
Closed
cstorm125
closed
3 years ago
cstorm125
commented
3 years ago
Assigments
Drop all duplicates.
How many characters in each line / average / total for dataset (
th
and
zh
)?
How many words in each line / average / total for dataset (
th
tokenized by
pythainlp.tokenize
; check your pythainlp version)?
How many words in each line / average / total for dataset (try
zh
tokenizers
jieba
,
pkuseg
, or any other ones you find interesting)?
zh-to-th word ratio in each line / average for dataset; for example,
(我吃飯, ฉันกินข้าว)
has 3
zh
words and 3
th
words so the ratio is $3/3=1$)
Find similarity score for each sentence pair and average for dataset using
multilingual universal sentence encoder
Assigments
th
andzh
)?th
tokenized bypythainlp.tokenize
; check your pythainlp version)?zh
tokenizers jieba, pkuseg, or any other ones you find interesting)?(我吃飯, ฉันกินข้าว)
has 3zh
words and 3th
words so the ratio is $3/3=1$)