Closed djokester closed 4 years ago
@aviks have a look once you are free.
This pull contains quite a few anti-patterns. Unfortunately I have no time to review it properly and looks quite unreadable. FWIW, an implementation of the co-occurence matrix similar to the DTM can be found here: https://github.com/zgornel/StringAnalysis.jl/blob/master/src/coom.jl
This implementation seems to be running 10 - 15X slower than the one in StringAnalysis. Porting the StringAnalysis one might be a better option. That one also supports the existing Documents/Corpora types and matches the prevailing style within the code-base.
This implementation seems to be running 10 - 15X slower than the one in StringAnalysis. Porting the StringAnalysis one might be a better option. That one also supports the existing Documents/Corpora types and matches the prevailing style within the code-base.
The String Analysis version doesn't allow you to have a window size equal to the sentence length which I need. Also, I used NamedArrays.jl for making it more user-friendly. I will put up a time log of the various parts to check where my time is going. Cosmetic changes or any other changes which would improve speed would be appreciated.
Cool, then. I didn't notice that minor window ~length~ size difference. My bad.
@Ayushk4 if you find time suggest improvements. I think the time difference is because I tried to accommodate different input types as well as tried to manipulate the matrix through NamedArrays.jl Ekbar check kar lena
The String Analysis version doesn't allow you to have a window size equal to the sentence length which I need. Also, I used NamedArrays.jl for making it more user-friendly.
Actually, you can process co-occurences in full sentences however each sentence must be treated as a document. The snippet below provides a simple example.
txt = "a first sentence. second sentence. third sentence which is longer"
sentences = String.(split(txt, "."))
crps = Corpus(StringDocument.(sentences))
cm = CooMatrix(crps, window=100)
Matrix(coom(cm))
I fail to actually see the utility of NamedArrays.jl
, especially since one can easily wrap at a later stage the resulting matrix if needed:
vocab = collect(keys(cm.column_indices))
NamedArray(coom(cm), (vocab, vocab))
Providing a minimalistic, fast API seems better to me. Especially, if the API is providing the flexibility to the users for modifying it at a later stage as per the user's needs. I think I agree with Corneliu on this.
Hi @djokester thanks for doing this. This PR was the springboard to #165, so it's certainly been useful. We really appreciate your efforts.
Additional Dependency Added: Pandas.jl Functionality: Given a string and window size calculate word co-occurrence matrix for the string. Additional Functionality: User can put in a list of strings to calculate word-co-occurrences with variable window size. Returns a DataFrame df from which df[Word1][Word2] returns the co-occurrence value. Tests and Documentation to be completed.