Closed atantos closed 5 months ago
Thanks. Would be good to get a test for this as well.
Well, since I am not experienced in writing tests tbh, do you think you could redirect me to a short example so that I prepare one? Regards!
Hi @atantos , last month I did some code refactoring and changed coom.jl. As a result, your code cannot be merged automatically. At the same time, I have some comments about your changes:
you changed an order of arguments in many methods. This will break compatibility of the API for current users. Instead of direction::Bool
as an argument, I think it would be better to use the keyword argument mode
and declare it by this way: CooMatrix(crps::Corpus, terms::Vector{String}; window::Int=5, normalize::Bool=true; mode=:default)
with possible values of the mode
as [:default, :directional]
regarding the implementation of the method function coo_matrix(..)
. You have a copy of another method. But at the same time, I see the only difference in the range of the inner cycle. for j in max(1, i-window):min(m, i+window)
for the default case. And for j in i:min(m, i + window)
for the directional case. If this is the only difference, the code may be implemented in the same method in the form:
for (i, token) in enumerate(doc)
inner_range = if mode == :directional
i:min(m, i + window)
else
max(1, i-window):min(m, i+window)
end
# looking forward
@inbounds for j in inner_range
...
end
end
Unit-tests for the default case are implemented in https://github.com/JuliaText/TextAnalysis.jl/blob/master/test/coom.jl
Hi @rssdev10 !
Thank you for your comments and am sorry for the delay!
- you changed an order of arguments in many methods. This will break compatibility of the API for current users. Instead of
direction::Bool
as an argument, I think it would be better to use the keyword argumentmode
and declare it by this way:CooMatrix(crps::Corpus, terms::Vector{String}; window::Int=5, normalize::Bool=true; mode=:default)
with possible values of themode
as[:default, :directional]
Am not sure which ones you mean. I placed the direction
argument in the fourth position everywhere. However, I find your alternative with the mode
argument elegant and I would have no objection going this way in the type declation.
- regarding the implementation of the method
function coo_matrix(..)
. You have a copy of another method. But at the same time, I see the only difference in the range of the inner cycle.for j in max(1, i-window):min(m, i+window)
for the default case. Andfor j in i:min(m, i + window)
for the directional case. If this is the only difference, the code may be implemented in the same method in the form:for (i, token) in enumerate(doc) inner_range = if mode == :directional i:min(m, i + window) else max(1, i-window):min(m, i+window) end # looking forward @inbounds for j in inner_range ... end end
Yes, actually I would not need to write a second method. One is enough with the if-else solution. So, no objection here either.
- Unit-tests for the default case are implemented in https://github.com/JuliaText/TextAnalysis.jl/blob/master/test/coom.jl
Thanks for the link! How should we proceed from now on? Should you make the changes and add them to the commit? Am really not that familiar with how it works in cases like this. And, lastly. There is still a conflict and am not sure whether it has to do with the changes you made, although it does not seem to be the case.
Your feedback and ideas are greatly appreciated!
Alex
Hi Alex!
I placed the direction argument in the fourth position everywhere.
There is a set of arguments in the current API that people are already using as described in the documentation. But if you add something to the fourth position when a method has 5 arguments, this means that you are making this method incompatible with previous code where something else was expected at the 4th position. So, abstractly, the new arguments should not be in the same 4th position, but in the last position. And if it is with some default value, this means that even the changed methods will be compatible with the previous code. Anyway, a keyword argument looks better in your case.
Thanks for the link! How should we proceed from now on? Should you make the changes and add them to the commit? Am really not that familiar with how it works in cases like this. And, lastly. There is still a conflict and am not sure whether it has to do with the changes you made, although it does not seem to be the case.
The changes I made are already in the master branch. So all you need to do is git pull origin master
and/or git merge master
in your local branch. Then resolve conflicts, add tests, do commit and push it back into the same branch with or without git rebase
.
Your feedback and ideas are greatly appreciated!
Let me know if you have any questions. Or, use Julia Slack for private questions.
I had to update this comment, since I did not have to do what I suggested earlier. So, I kept your advice with the inner_range
. I made all the changes, I added tests for my case and also a few others for a case that was missing and, am looking forward for the pull request to be "approved".
Thanks again for the help, @rssdev10 !
Alex
Thanks, merged in the master branch.
Added a directional
coo_matrix()
version tocoom.jl
so that the directional or asymmetric coocurrence matrix can be built withCooMatrix()
.The current version of coo_matrix() is bidirectional or symmetric and it looks ahead and back in the context of a window size n from the focus word. The directional coocurrence matrix takes into consideration only words that follow and in that sense it is asymmetric. The main result is that the coocurrence matrix is no longer symmetrical in the linear algebra sense and the cells do not include values that represent double frequencies of two words (of the crossing row and column). This means that no frequency value represents the frequency of word A in the vicinity of a window n around B as well as the opposite. Let's assume that word A is on row i and word B on column j. The cell (i,j) has the frequency of word B found n words after the word A on the text. The cell (j,i), on the other hand, has the has the frequency of word A found n words after the word B on the text. These two values, apparently are not identical. That is why the directional cooccurence matrix is non-symmetric.
On coom.jl the new method
coo_matrix()
(on line 59) is added that redirects to the originalcoo_matrix()
if the direction argument is set tofalse
.The
direction::Bool
argument is also added to all CooMatrix() constructor methods that usecoo_matrix()
further down the line (on the same file).Having a directional coocurrence matrix will make things easier for calculating word association measures.
Please ignore the earlier September commits that are only indirectly related to today's commit.