XapaJIaMnu / translateLocally

Fast and secure translation on your local machine, powered by marian and Bergamot.
MIT License
507 stars 29 forks source link

avoid translation of Named Entities #127

Open opme opened 1 year ago

opme commented 1 year ago

I noticed that named entities like company names are getting translated.

I was thinking of running a preprocessor model like from spacy.io to flag all the named entities. I then want to avoid translating those.

I am wondering if there is an official way to prevent translation within the text sent to translatelocally.

For example: China Nonferrous Gold Limited -> Kiina Non Iron Gold Limited (finnish from Opus-mt student)

Using the student models is the best solution for translating large amounts of text with limited computer power. I am playing around with translating a site I am building to many languages but just 80k paragraphs was going to take months on a single computer. Here I can do it in one night.

jelmervdl commented 1 year ago

It's an issue we're aware of, but don't have a solution for yet. We're thinking along the same lines though!

Our plan was adding support for placeholders, e.g. placeholders in the input sentence would be translated as is into the output sentence (but in the proper position). We could then replace some or all named entities, urls, email addresses, etc, with placeholders and put them back in after translation. Problem with this approach is that the model has to be trained with placeholder support. So this won't work with our current models.

What you could try is to use the support for HTML translation that's in bergamot-translator (the library backing translateLocally.) I just pushed a commit to the main branch to make that accessible from the command line. With that version, you should be able to do something like:

echo "The train leaves for <span>London St. Pancras</span> at quarter past six." | ./translateLocally -m eng-fin-tiny --html
Juna lähtee <span>Lontoo St. Pancras</span> neljännestä yli kuusi.

HTML support is not really meant for this, but it might get you at least half way. You can add <span id="1"> etc around the named entities. They will be translated, but at least you know where they are and you can put the original back in. If you'd rather hide them from the translation engine you can insert a blank <span id="1"></span> in there but that might confuse the translation model even more (because it's not trained with missing words).

opme commented 1 year ago

Thank you. It is working with the html support in all languages except estonian. The html support looks to be broken in the estonian model. I'm doing the preprocessing with a spacey model that is able to detect full names. I then add the span and regex then back to the original after the translation.

I'm also see what looks like a memory leakage though it can be worked around by restarting the sub process every 1000 iterations.

I am still working on the scripts and will post an example when it is stable.

   # load model to handle named entites
    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    model_checkpoint = "xlm-roberta-large-finetuned-conll03-english"
    token_classifier = pipeline(
        "token-classification", model=model_checkpoint, aggregation_strategy="simple"
    )

Example of the estonian issues. hmm.

 echo "<span id=\"1\">Atrium Mortgage Investment Corporation</span>, a non-bank lender, provides financing solutions to the real estate communities in Ontario, Alberta, and British Columbia. It offers various types of mortgage loans for residential, multi-residential, and commercial real properties" | ./translate
Locally -m en-et-tiny --html

<span id="1">Panka</span> mittekuuluva laenuandja <span id="1">Atrium Mortgage Investment Corporation</span> pakub rahastamislahendusi Ontario, Alberta ja Briti Columbia kinnisvarakogukondadele. Ta pakub erinevat tüüpi eluasemelaenu elamu-, multiresidentide ja ärikinnisvarale

jelmervdl commented 1 year ago

I think you're seeing the results of using alignment scores for inserting HTML, and why it isn't ideal for your use case. What it basically does is look per output token which source token aligns best according to some alignment model.

There's no guarantee in there that there's a 1-to-1 mapping, and the HTML reconstruction is allowed to duplicate elements if it thinks that a span in the input sentence got split up in the translated sentence. You might want to do some post-processing to decide which ones of the spans is the actual named entity.