SUSYUSTC / MathTranslate

translate scientific papers in latex, especially arxiv papers
https://github.com/SUSYUSTC/MathTranslate
Apache License 2.0
1.04k stars 69 forks source link

do not translate words from a given vocabulary #72

Open rotcx opened 9 months ago

rotcx commented 9 months ago

e.g., do not translate LLM to 法学硕士. Leave it as LLM.

e.g., do not Transformer LLM to 变压器. Leave it as Transformer.

rotcx commented 9 months ago

if we could not set such a non-translating vocab for translators (google, tencent ... )

the only way is to remedy it replace the (wrongly) translated words to the origin EN word after translation ...

rotcx commented 9 months ago

An impl could be:

    from functools import reduce
    replace_dict = {"法学硕士": "LLM", "变压器": "Transformer", "代币":"token"}
    text_final = reduce(lambda text, kv: text.replace(*kv), replace_dict.items(), text_final)

image

rotcx commented 9 months ago

Another (downstream way) is to proc the translated main.tex file:

#!/bin/bash

declare -A replace_dict=(["法学硕士"]="LLM" ["变压器"]="Transformer" ["代币"]="token")

while read -r line; do
    for key in "${!replace_dict[@]}"; do
        line=${line//${key}/${replace_dict[$key]}}
    done
    echo $line
done < main.tex
rotcx commented 9 months ago

iter all .tex files of directory dir and proc (as we could not in general not know which .tex is the main tex file?):

#!/bin/bash

declare -A replace_dict=(["法学硕士"]="LLM" ["变压器"]="Transformer" ["代币"]="token")

find dir -name "*.tex" | while read -r file; do
    while read -r line; do
        for key in "${!replace_dict[@]}"; do
            line=${line//${key}/${replace_dict[$key]}}
        done
        echo $line
    done < "$file"
done
sherrylixuecheng commented 8 months ago

Thank you for reporting issues to us. Since we are a general translation tool instead of a tool only working for CS or DL, we think it might be better to leave it as what it is temporarily. We could consider a functionality as a "user dictionary", by asking the users to manually define the "popular vocabulary". The only thing user need is to load a list of vocabulary. Similar to your solution here but more systematic and friendly to users. @SUSYUSTC