cbaziotis / ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
MIT License
660 stars 91 forks source link

Warning regarding using TextPreProcessor as a preprocessing for torchtext.data.Field() #7

Closed davidalbertonogueira closed 6 years ago

davidalbertonogueira commented 6 years ago

As it can be seen in the code sample below, we get different results if

Using the Field preprocessing pipeline, the _text_processor will be called on a token level, instead of on a sentence-level, and expressions like "October 10th" that will be converted to <date>, will not be correctly converted, as the text_processor_ will be called on two separate tokens "October" and "10th", and the last one will be break into "1 0 th" after that call.

from torchtext import data, vocab
import torch.optim as optim
import torch.nn.functional as F
import torch.nn as nn
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons

text_processor = TextPreProcessor(
            # terms that will be normalized
            normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
                'time', 'url', 'date', 'number'],
            # terms that will be annotated
            annotate={"hashtag", "allcaps", "elongated", "repeated",
                'emphasis', 'censored'},
            fix_html=True,  # fix HTML tokens

            # corpus from which the word statistics are going to be used
            # for word segmentation
            segmenter="twitter",

            # corpus from which the word statistics are going to be used
            # for spell correction
            corrector="twitter",

            unpack_hashtags=True,  # perform word segmentation on hashtags
            unpack_contractions=True,  # Unpack contractions (can't -> can not)
            spell_correct_elong=False,  # spell correction for elongated words

            # select a tokenizer. You can use SocialTokenizer, or pass your own
            # the tokenizer, should take as input a string and return a list of tokens
            tokenizer=SocialTokenizer(lowercase=True).tokenize,

            # list of dictionaries, for replacing tokens extracted from the text,
            # with other expressions. You can pass more than one dictionaries.
            dicts=[emoticons]
        )

Reading twitter - 1grams ... Reading twitter - 2grams ... Reading twitter - 1grams ...

>>> def custom_processing(x, text_processor):
...    text = " ".join(text_processor.pre_process_doc(x))
...    return text

>>> text = "That Mexico vs USA commercial with trump gets your blood boiling. Race war October 10th. Imagine that parking lot. Gaddamnnnnnn VIOLENCE!!!"

>>> text_to_process = data.Field(preprocessing=data.Pipeline(lambda x : custom_processing(x, text_processor) ) )
>>> Dataset_input = [data.Example.fromlist(data=[text],fields=[('text', text_to_process)])]
>>> Dataset_input[0]

<torchtext.data.example.Example object at 0x000001ACFA71A748>

>>> Dataset_input[0].text

['that', 'mexico', 'vs', '<allcaps> usa </allcaps>', 'commercial', 'with', 'trump', 'gets', 'your', 'blood', 'boiling .', 'race', 'war', 'october', '1 0 th .', 'imagine', 'that', 'parking', 'lot .', 'gaddamn <elongated>', '<allcaps> violence </allcaps> ! <repeated>']

>>> processed_text = " ".join(text_processor.pre_process_doc(text))
>>> processed_text

'that mexico vs <allcaps> usa </allcaps> commercial with trump gets your blood boiling . race war <date> . imagine that parking lot . gaddamn <elongated> <allcaps> violence </allcaps> ! <repeated>'

>>> print( " ".join(text_processor.pre_process_doc("10th")))

1 0 th
davidalbertonogueira commented 6 years ago

ref https://github.com/pytorch/text/issues/388