JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
373 stars 95 forks source link

Added Word Co Occurrence Matrix Function #164

Closed djokester closed 4 years ago

djokester commented 5 years ago

Additional Dependency Added: Pandas.jl Functionality: Given a string and window size calculate word co-occurrence matrix for the string. Additional Functionality: User can put in a list of strings to calculate word-co-occurrences with variable window size. Returns a DataFrame df from which df[Word1][Word2] returns the co-occurrence value. Tests and Documentation to be completed.

djokester commented 5 years ago

@aviks have a look once you are free.

zgornel commented 5 years ago

This pull contains quite a few anti-patterns. Unfortunately I have no time to review it properly and looks quite unreadable. FWIW, an implementation of the co-occurence matrix similar to the DTM can be found here: https://github.com/zgornel/StringAnalysis.jl/blob/master/src/coom.jl

Ayushk4 commented 5 years ago

This implementation seems to be running 10 - 15X slower than the one in StringAnalysis. Porting the StringAnalysis one might be a better option. That one also supports the existing Documents/Corpora types and matches the prevailing style within the code-base.

djokester commented 5 years ago

This implementation seems to be running 10 - 15X slower than the one in StringAnalysis. Porting the StringAnalysis one might be a better option. That one also supports the existing Documents/Corpora types and matches the prevailing style within the code-base.

The String Analysis version doesn't allow you to have a window size equal to the sentence length which I need. Also, I used NamedArrays.jl for making it more user-friendly. I will put up a time log of the various parts to check where my time is going. Cosmetic changes or any other changes which would improve speed would be appreciated.

Ayushk4 commented 5 years ago

Cool, then. I didn't notice that minor window ~length~ size difference. My bad.

djokester commented 5 years ago

@Ayushk4 if you find time suggest improvements. I think the time difference is because I tried to accommodate different input types as well as tried to manipulate the matrix through NamedArrays.jl Ekbar check kar lena

zgornel commented 5 years ago

The String Analysis version doesn't allow you to have a window size equal to the sentence length which I need. Also, I used NamedArrays.jl for making it more user-friendly.

Actually, you can process co-occurences in full sentences however each sentence must be treated as a document. The snippet below provides a simple example.

txt = "a first sentence. second sentence. third sentence which is longer"
sentences = String.(split(txt, "."))
crps = Corpus(StringDocument.(sentences))
cm = CooMatrix(crps, window=100)
Matrix(coom(cm))

I fail to actually see the utility of NamedArrays.jl, especially since one can easily wrap at a later stage the resulting matrix if needed:

vocab = collect(keys(cm.column_indices))
NamedArray(coom(cm), (vocab, vocab))
Ayushk4 commented 5 years ago

Providing a minimalistic, fast API seems better to me. Especially, if the API is providing the flexibility to the users for modifying it at a later stage as per the user's needs. I think I agree with Corneliu on this.

aviks commented 4 years ago

Hi @djokester thanks for doing this. This PR was the springboard to #165, so it's certainly been useful. We really appreciate your efforts.