Dataset cleaning - Githubissues

ElieKadoche commented 4 years ago

The dataset seems to contain numerous invalid characters such as "@", "#" or even links. Should we clean that, for the training and testing dataset, before running any model?

ElieKadoche commented 4 years ago

It seems that preprocessing the dataset could be an important step.

What I propose, is to create a pre_processing function in utils/datasets.py doing the job and integrate it into the model template model_template.py.

Or we could integrate it in the data creation script.

mingdinh commented 4 years ago

Hi!

Thanks for the remark! I wonder if it is better to use the same cleaning file for all models (this case will help to be transparent for our comparison), or to define the different cleaning files for each different model?

If it is the 1st case, as in classroom lab work I suggest using this github to clean up the data: https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py

import re
import numpy as np
import torch as th
import torch.autograd as ag
import torch.nn.functional as F
import torch.nn as nn
import random

th.manual_seed(1) # set the seed 

def clean_str(string, tolower=True):
    """
    Tokenization/string cleaning.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    if tolower:
        string = string.lower()
    return string.strip()

def loadTexts(filename, limit=-1):
    """
    Texts loader for imdb.
    If limit is set to -1, the whole dataset is loaded, otherwise limit is the number of lines
    """
    f = open(filename)
    dataset=[]
    line =  f.readline()
    cpt=1
    skip=0
    while line :
        cleanline = clean_str(line).split()
        if cleanline: 
            dataset.append(cleanline)
        else: 
            line = f.readline()
            skip+=1
            continue
        if limit > 0 and cpt >= limit: 
            break
        line = f.readline()
        cpt+=1        

    f.close()
    print("Load ", cpt, " lines from ", filename , " / ", skip ," lines discarded")
    return dataset

ElieKadoche commented 4 years ago

I totaly agree, we should define the same cleaning script for all models. This is the reason why it should be done in the data creation script.

I personnaly do not like to copy / paste code from other projects. We can use them as inspiration and cite them, but just taking code like that is not very subtle. Or we should find a specific library designed for such a job. In every cases, we should justify all the choices we do.

Feel free to create a pull request by explaining your motivation and your implementation choices.

mingdinh commented 4 years ago

Hi!

Thanks for the remark! I push a clean_dataset.py file in the model model branch!

Look at the Twitter dataset, I think it's nice to remove the punctuation, the mention (@), the hashtag (#), the HTML code, the URL, the spaces ... and lower the words.

Here is an example before and after my proposed architecture for cleaning the dataset:

x_test_clean[0] Out[13]: 'awww that s a bummer you shoulda got david carr of third day to do it d'

x_test[0] Out[14]: "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D" . x_train[110] Out[19]: "RT @designplay Goodby, Silverstein's new site: http://www.goodbysilverstein.com/ I enjoy it. nice find!"

x_train_clean[110] Out[20]: 'rt goodby silverstein s new site i enjoy it nice find'

ElieKadoche commented 4 years ago

You CAN NOT push on a branch that does not belong to you. You need to create a new one. Please stop doing that.

ElieKadoche commented 4 years ago

But the function seems to work well, thank you. I will do little modifications to better intregrate it to the project.

Remarks: stop pushing on an existing branch, you need to create a new one. Your code needs to respect PEP8 specification.

But overwall your code seems good, thanks you!

ElieKadoche commented 4 years ago

Since it is a for loop, it takes quite some time for the training dataset, but it seems to work well. There were several mistakes in your script. You can check the commit to see them.

I considered this issue closed, solved in #5. It can be re-open anytime if someone sees any issue with the clean_dataset.py script.

XanX3601 / deep_sentiment140

Dataset cleaning #4