Text classification analysis

choldgraf commented 9 years ago

This project will use classification methods to categorize / cluster POs based on what's in the description field, supplier name, etc.

Project lead is @juanshishido with assistance from @kaiweitan

choldgraf commented 9 years ago

@juanshishido when you create the "co-occurrence" matrix, how big is it? If it's big / takes a while to make, then we should have a script that creates this data in a separate HDF5 file or something. Otherwise we can just generate on the fly

juanshishido commented 9 years ago

@choldgraf The matrix is (611110, 165508) on UCB_dept_merge.csv. This could use more/better text cleaning. For example, keeping forward slashes when they pertain to actual fractions only. I'm currently just:

Removing Nans
Changing all to lowercase
Keeping only alphanumeric and '.' and '%'
Removing periods at the start of the string
Replacing multiple whitespace with a single space

Here is the code:

for col in cols:
    df[col] = df[col].replace(np.nan, '' , regex=True) \
                .apply(lambda x: x.lower()) \
                .apply(lambda x: re.sub('[^A-Za-z0-9.%]+', ' ', x)) \
                .apply(lambda x: re.sub('^\.+', '', x)) \
                .apply(lambda x: re.sub('^\/', '', x)) \
                .apply(lambda x: re.sub('\s+', ' ', x).strip())

(Looks like I am removing single forward slashes, too, though this is probably already taken care of in re.sub('[^A-Za-z0-9.%]+', ' ', x)) )

Maybe we can use a stemmer, though I'm not sure it would be appropriate here. This would cut down on the number of features.

While it doesn't take too long to run, it's probably a good idea to save the matrix in a file. I'll look into HDF5 (I've only read from that format once).

juanshishido commented 9 years ago

By the way, there are two LDA modules (probably more) in Python:

choldgraf commented 9 years ago

Sounds good - looks like we can use sklearn then for now, and see what kinds of output / running time we get.

Regarding file formats, for now we can just try to save it in pandas hdf5 format. the call is just df.to_hdf(file_path, data_key) I think

On Sun, Mar 22, 2015 at 11:26 PM, Juan Shishido notifications@github.com wrote:

@choldgraf https://github.com/choldgraf The matrix is (611110, 165508) on UCB_dept_merge.csv. This could use more/better text cleaning. For example, keeping forward slashes when they pertain to actual fractions only. I'm currently just:

Removing Nans

Changing all to lowercase

Keeping only alphanumeric and '.' and '%'

Removing periods at the start of the string

Replacing multiple whitespace with a single space

Here is the code:

for col in cols: df[col] = df[col].replace(np.nan, '' , regex=True) \ .apply(lambda x: x.lower()) \ .apply(lambda x: re.sub('[^A-Za-z0-9.%]+', ' ', x)) \ .apply(lambda x: re.sub('^.+', '', x)) \ .apply(lambda x: re.sub('^\/', '', x)) \ .apply(lambda x: re.sub('\s+', ' ', x).strip())

(Looks like I am removing single forward slashes, too, though this is probably already taken care of in re.sub('[^A-Za-z0-9.%]+', ' ', x)) )

Maybe we can use a stemmer, though I'm not sure it would be appropriate here. This would cut down on the number of features.

While it doesn't take too long to run, it's probably a good idea to save the matrix in a file. I'll look into HDF5 (I've only read from that format once).

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/9#issuecomment-84832204 .

choldgraf commented 9 years ago

Also - it looks like you can accomplish a lot of your modifications in pandas using the DataFrame.str method. This gives you a lot of string operation abilities that are similar to what you're doing with .apply and regular expressions.

juanshishido commented 9 years ago

Thanks, @choldgraf. I've battled through regex to get something that's pretty solid.

for col in cols:
    df[col] = df[col].replace(np.nan, '' , regex=True)                                   \
                     .apply(lambda x: x.lower())                                         \
                     .apply(lambda x: re.sub('(http\S*|www\S*)', '', x))                 \
                     .apply(lambda x: re.sub('((?<=\D)/|/(?=\D))', ' ', x))              \
                     .apply(lambda x: re.sub('[^A-Za-z0-9.%\/]+', ' ', x))               \
                     .apply(lambda x: re.sub('\.+', '', x))                              \
                     .apply(lambda x: re.sub('(?<=\s)\w(?=\s)|(?<=\s)\d(?=\s)', '', x))  \
                     .apply(lambda x: re.sub('\s+', ' ', x).strip())

Replaces same as before, but also:

removes URLs
removes forward slashes preceded or followed by non-digits
remove single characters

juanshishido commented 9 years ago

@choldgraf I've uploaded three files to Drive, all based on a random sample of 20,000: the 100 topic definitions (with the most frequent 10 words), the top 5 and bottom 5 topics, the top 100 words (with frequencies).

I planned on putting the top 100 words into a word cloud, but I'm having trouble with the module. It can't access something called PIL and its associated fonts.

choldgraf commented 9 years ago

Nice! A couple thoughts:

Wow that's some solid regex right there. One thing that might be easier is to create a list of lambda functions, and then iterate through that list to apply them to the dataframe. Would save you some lines of code and applys
You might also consider removing common words in the english language. E.g., check this page out. It has a list of super common english words that should be removed too.
PIL stands for python imaging library. You should be able to install it with conda or pip...did you try that?

juanshishido commented 9 years ago

Thanks, @choldgraf!

In the CountVectorizer, I use the english stopwords with:

# Initialize CountVevtorizer, remove stop words
vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = 'english')

But, it definitely seems like it's not doing what we need. I will use the words in the link you provided and rerun. Thanks for that!

choldgraf commented 9 years ago

Yeah - I think I saw quite a few words in there. Maybe the CountVectorizer is just not initialized properly? Either way, a quick words = [word for word in words if word not in drop_list] should suffice more or less.

And let me know if you've got a word cloud working. This is the only project that I don't have a picture for :)

juanshishido commented 9 years ago

I'm going to just use the top 100 words on the list since it seems to have a lot of words that look like they might be meaningful.

I'll work on the word cloud after rerunning the LDA. I'll do my best to get it working.

juanshishido commented 9 years ago

Here is my update.

Prep

Data cleaning

removed the English stopwords from nltk.corpus.stopwords
removed individual characters
removed numbers (excludes numbers with non-numeric characters attached, such as %)

Data use

included supplier name and manufacturer, where available, in the documents to be classified

LDA

increased sample size to 40,000
Output classification of topics from 5 to 15

To do

I have the data prepped to feed to the stm library in R. Planning to run that tomorrow (Wednesday)

juanshishido commented 9 years ago

Hey @choldgraf,

Let me know what you think about this. Should I push the topic definitions files to the results folder?

Thanks.

choldgraf commented 9 years ago

Nice - that looks great. Do you have the outputs from latest clustering for browsing? You can just put the output in a text file and put it in the results folder and I'll take a look if you like.

choldgraf commented 9 years ago

Have you played around with varying the number of clusters to see if that makes a big difference?

juanshishido commented 9 years ago

I've just put those files in the results folder. There is one for each of the 10 number of clusters that I tried, 5 through 15. Not seeing any software topic at first glance.

choldgraf commented 9 years ago

Can you put these in the google drive results folder rather than committing them? Just delete any outdated results or put it in a date-specific folder.

On Wed, Apr 29, 2015 at 3:35 PM, Juan Shishido notifications@github.com wrote:

I've just put those files in the results folder. There is one for each of the 10 number of clusters that I tried, 5 through 15. Not seeing any software topic at first glance.

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/9#issuecomment-97606702 .

juanshishido commented 9 years ago

Oops. Just put them on Drive.

choldgraf commented 9 years ago

Hmm - so it looks like a word can live in multiple categories then, no? I see "office" in the first two groups...

On Wed, Apr 29, 2015 at 3:51 PM, Juan Shishido notifications@github.com wrote:

Oops. Just put them on Drive.

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/9#issuecomment-97608902 .

juanshishido commented 9 years ago

Yes, that's true. Just looked at the LDA page on Wikipedia and it says that a "word may occur in several topics with a different probability."

I'm going to try stm and see how it goes.

choldgraf commented 9 years ago

OK, it could be that you can get a listing of the probability values themselves, then just assign each word to the topic with highest probability. On Apr 29, 2015 8:16 PM, "Juan Shishido" notifications@github.com wrote:

Yes, that's true. Just looked at the LDA page on Wikipedia and it says that a "word may occur in several topics with a different probability."

I'm going to try stm and see how it goes.

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/9#issuecomment-97648011 .

BIDS-collaborative / purchasing