Closed choldgraf closed 8 years ago
@juanshishido when you create the "co-occurrence" matrix, how big is it? If it's big / takes a while to make, then we should have a script that creates this data in a separate HDF5 file or something. Otherwise we can just generate on the fly
@choldgraf The matrix is (611110, 165508) on UCB_dept_merge.csv. This could use more/better text cleaning. For example, keeping forward slashes when they pertain to actual fractions only. I'm currently just:
Here is the code:
for col in cols:
df[col] = df[col].replace(np.nan, '' , regex=True) \
.apply(lambda x: x.lower()) \
.apply(lambda x: re.sub('[^A-Za-z0-9.%]+', ' ', x)) \
.apply(lambda x: re.sub('^\.+', '', x)) \
.apply(lambda x: re.sub('^\/', '', x)) \
.apply(lambda x: re.sub('\s+', ' ', x).strip())
(Looks like I am removing single forward slashes, too, though this is probably already taken care of in re.sub('[^A-Za-z0-9.%]+', ' ', x))
)
Maybe we can use a stemmer, though I'm not sure it would be appropriate here. This would cut down on the number of features.
While it doesn't take too long to run, it's probably a good idea to save the matrix in a file. I'll look into HDF5 (I've only read from that format once).
By the way, there are two LDA modules (probably more) in Python:
Sounds good - looks like we can use sklearn then for now, and see what kinds of output / running time we get.
Regarding file formats, for now we can just try to save it in pandas hdf5
format. the call is just df.to_hdf(file_path, data_key)
I think
On Sun, Mar 22, 2015 at 11:26 PM, Juan Shishido notifications@github.com wrote:
@choldgraf https://github.com/choldgraf The matrix is (611110, 165508) on UCB_dept_merge.csv. This could use more/better text cleaning. For example, keeping forward slashes when they pertain to actual fractions only. I'm currently just:
- Removing Nans
- Changing all to lowercase
- Keeping only alphanumeric and '.' and '%'
- Removing periods at the start of the string
- Replacing multiple whitespace with a single space
Here is the code:
for col in cols: df[col] = df[col].replace(np.nan, '' , regex=True) \ .apply(lambda x: x.lower()) \ .apply(lambda x: re.sub('[^A-Za-z0-9.%]+', ' ', x)) \ .apply(lambda x: re.sub('^.+', '', x)) \ .apply(lambda x: re.sub('^\/', '', x)) \ .apply(lambda x: re.sub('\s+', ' ', x).strip())
(Looks like I am removing single forward slashes, too, though this is probably already taken care of in re.sub('[^A-Za-z0-9.%]+', ' ', x)) )
Maybe we can use a stemmer, though I'm not sure it would be appropriate here. This would cut down on the number of features.
While it doesn't take too long to run, it's probably a good idea to save the matrix in a file. I'll look into HDF5 (I've only read from that format once).
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/9#issuecomment-84832204 .
Also - it looks like you can accomplish a lot of your modifications in pandas using the DataFrame.str
method. This gives you a lot of string operation abilities that are similar to what you're doing with .apply
and regular expressions.
Thanks, @choldgraf. I've battled through regex to get something that's pretty solid.
for col in cols:
df[col] = df[col].replace(np.nan, '' , regex=True) \
.apply(lambda x: x.lower()) \
.apply(lambda x: re.sub('(http\S*|www\S*)', '', x)) \
.apply(lambda x: re.sub('((?<=\D)/|/(?=\D))', ' ', x)) \
.apply(lambda x: re.sub('[^A-Za-z0-9.%\/]+', ' ', x)) \
.apply(lambda x: re.sub('\.+', '', x)) \
.apply(lambda x: re.sub('(?<=\s)\w(?=\s)|(?<=\s)\d(?=\s)', '', x)) \
.apply(lambda x: re.sub('\s+', ' ', x).strip())
Replaces same as before, but also:
@choldgraf I've uploaded three files to Drive, all based on a random sample of 20,000: the 100 topic definitions (with the most frequent 10 words), the top 5 and bottom 5 topics, the top 100 words (with frequencies).
I planned on putting the top 100 words into a word cloud, but I'm having trouble with the module. It can't access something called PIL and its associated fonts.
Nice! A couple thoughts:
apply
sconda
or pip
...did you try that? Thanks, @choldgraf!
In the CountVectorizer
, I use the english stopwords with:
# Initialize CountVevtorizer, remove stop words
vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = 'english')
But, it definitely seems like it's not doing what we need. I will use the words in the link you provided and rerun. Thanks for that!
Yeah - I think I saw quite a few words in there. Maybe the CountVectorizer
is just not initialized properly? Either way, a quick words = [word for word in words if word not in drop_list]
should suffice more or less.
And let me know if you've got a word cloud working. This is the only project that I don't have a picture for :)
I'm going to just use the top 100 words on the list since it seems to have a lot of words that look like they might be meaningful.
I'll work on the word cloud after rerunning the LDA. I'll do my best to get it working.
Here is my update.
nltk.corpus.stopwords
I have the data prepped to feed to the stm
library in R. Planning to run that tomorrow (Wednesday)
Hey @choldgraf,
Let me know what you think about this. Should I push the topic definitions files to the results folder?
Thanks.
Nice - that looks great. Do you have the outputs from latest clustering for browsing? You can just put the output in a text file and put it in the results folder and I'll take a look if you like.
Have you played around with varying the number of clusters to see if that makes a big difference?
I've just put those files in the results folder. There is one for each of the 10 number of clusters that I tried, 5 through 15. Not seeing any software topic at first glance.
Can you put these in the google drive results folder rather than committing them? Just delete any outdated results or put it in a date-specific folder.
On Wed, Apr 29, 2015 at 3:35 PM, Juan Shishido notifications@github.com wrote:
I've just put those files in the results folder. There is one for each of the 10 number of clusters that I tried, 5 through 15. Not seeing any software topic at first glance.
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/9#issuecomment-97606702 .
Oops. Just put them on Drive.
Hmm - so it looks like a word can live in multiple categories then, no? I see "office" in the first two groups...
On Wed, Apr 29, 2015 at 3:51 PM, Juan Shishido notifications@github.com wrote:
Oops. Just put them on Drive.
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/9#issuecomment-97608902 .
Yes, that's true. Just looked at the LDA page on Wikipedia and it says that a "word may occur in several topics with a different probability."
I'm going to try stm
and see how it goes.
OK, it could be that you can get a listing of the probability values themselves, then just assign each word to the topic with highest probability. On Apr 29, 2015 8:16 PM, "Juan Shishido" notifications@github.com wrote:
Yes, that's true. Just looked at the LDA page on Wikipedia and it says that a "word may occur in several topics with a different probability."
I'm going to try stm and see how it goes.
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/9#issuecomment-97648011 .
This project will use classification methods to categorize / cluster POs based on what's in the description field, supplier name, etc.
Project lead is @juanshishido with assistance from @kaiweitan