How to exclude number and URL from vocabulary in translation?

ttpro1995 commented 6 years ago

I would like to make URL and Number (ex: phone number) as special token (, or such) in vocabulary list. In translation, it will re-write number or url from source to corresponding place in destination sentence.

How should I do that ? Is that function have been implemented yet. If not, where should I start ?

Example: Input: I have 2 cat and 3 dog. Please call me at 823123532 if you see them. Expected output: Tôi có 2 con mèo và 3 con chó. Vui lòng gọi cho tôi theo số 823123532 nếu bạn thấy chúng.

My situation now that the translation machine trying to put number, url into vocabulary list. So, sometime, I see weird number/url in my destination sentence.

pltrdy commented 6 years ago

I guess the simplest way to achieve this would be to replace such numbers in your dataset by special tags e.g. @number0, @number1 etc...

Doing this will put those tags in the vocabulary and the model will be able to learn how to use it properly. The model will then include some special tags in the translation that you can replace yourself as a post-processing step. You could also just replace all numbers by the same token e.g. # or 0.

Example (with tags):

Initial Input: I have 2 cat and 3 dog. Please call me at 823123532 if you see them.
Pre-processed input: I have @number0 cat and @number1 dog. Please call me at @number2 if you see them. (you must keep track of replacement in a file, e.g. @number0:2 ; @number1:3; @number2: 823123532)
Expected output: Tôi có @number0 con mèo và @number1 con chó. Vui lòng gọi cho tôi theo số @number2 nếu bạn thấy chúng.
Post-processing output: Tôi có 2 con mèo và 3 con chó. Vui lòng gọi cho tôi theo số 823123532 nếu bạn thấy chúng.

I've been using similar tags that in my case corresponds to NER tagging in the context of summarization, as in Nallapati 2016. Note that Nallapati replaces all numbers with zeros.

Hope it helps

ttpro1995 commented 6 years ago

if I use @number1 , and @number2, will OpenNMT-py interprete them as 2 difference vocabulary, but then they assume that first number in sentences have relationship because they are represent as @number1, isn't it ? (I wonder if @ make opennmt-py ignore token or something ...)

i wonder if I replace all number with same num token, does it help or hurt ?

What I did so far:

replace number, url with _num_, _url_ before feed into model while save the list of original tok. Then replace it back in output sentence with same order in the list. I assume that the order of original number/url tok are kept (It should keep in my dataset).

Please review if there are any problem with my solution.

def preprocess(text):
    """
    replace url and number with _url_ , _num_
    :param text:
    :return:
    """
    text = text.lower()

    num_re = r'[-+]?\d*\.\d+|\d+'
    WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""

    _num_list = re.findall(num_re, text)
    _url_list = re.findall(WEB_URL_REGEX, text)

    result = re.sub(num_re, " _num_ ", text)
    result2 = re.sub(WEB_URL_REGEX, " _url_", result)
    toks = word_tokenize(result2)

    return " ".join(toks), _num_list, _url_list

def postprocess(text, _num_list, _url_list):
    """
    replace _url_ and _num_ with original url and number
    :param text:
    :return:
    """

    text_tok = text.split(" ")
    for i in range(len(text_tok)):
        if text_tok[i] == "_num_":
            try: # make sure empty list dont crash
                num = _num_list.pop(0) # pop first element
                text_tok[i] = str(num)
            except:
                pass # do nothing
        elif text_tok[i] == "_url_":
            try:
                url = _url_list.pop(0) # pop first element
                text_tok[i] = str(url)
            except:
                pass

    return " ".join(text_tok)

srush commented 6 years ago

This is the correct thing to do.

If you want to get even more fancy, you can use -attn_debug to output the attention and use that in case the numbers get reordered.

OpenNMT / OpenNMT-py

How to exclude number and URL from vocabulary in translation? #661