azhe825 / CSC510

Course Project for CSC 510, 2016 spring
1 stars 3 forks source link

Preprocessed Data #24

Closed amritbhanu closed 8 years ago

amritbhanu commented 8 years ago

@imaginationsuper Can you please remove the extra whitespaces (just keep 1 whitespace) from the dataset.txt file? So that it is easier to extract by our code.

amritbhanu commented 8 years ago

This code will do the job.

re.sub(r'\s+', ' ', "abc   xyz     lmn")+"\n")
abc xyz lmn
jerry-shijieli commented 8 years ago

Code updated. Use emailParserX.py to get dataset.txt and dataConversionF.py to convert to word vectors.

jerry-shijieli commented 8 years ago

The 'L' letter in wordVectors.txt means long int data type, which is produced by the scikit-learn tokenizer