JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
373 stars 95 forks source link

Directional coom #264

Closed atantos closed 5 months ago

atantos commented 1 year ago

Added a directional coo_matrix() version to coom.jl so that the directional or asymmetric coocurrence matrix can be built with CooMatrix().

The current version of coo_matrix() is bidirectional or symmetric and it looks ahead and back in the context of a window size n from the focus word. The directional coocurrence matrix takes into consideration only words that follow and in that sense it is asymmetric. The main result is that the coocurrence matrix is no longer symmetrical in the linear algebra sense and the cells do not include values that represent double frequencies of two words (of the crossing row and column). This means that no frequency value represents the frequency of word A in the vicinity of a window n around B as well as the opposite. Let's assume that word A is on row i and word B on column j. The cell (i,j) has the frequency of word B found n words after the word A on the text. The cell (j,i), on the other hand, has the has the frequency of word A found n words after the word B on the text. These two values, apparently are not identical. That is why the directional cooccurence matrix is non-symmetric.

On coom.jl the new method coo_matrix() (on line 59) is added that redirects to the original coo_matrix() if the direction argument is set to false.

The direction::Bool argument is also added to all CooMatrix() constructor methods that use coo_matrix() further down the line (on the same file).

Having a directional coocurrence matrix will make things easier for calculating word association measures.

Please ignore the earlier September commits that are only indirectly related to today's commit.

atantos commented 11 months ago

Thanks. Would be good to get a test for this as well.

Well, since I am not experienced in writing tests tbh, do you think you could redirect me to a short example so that I prepare one? Regards!

rssdev10 commented 7 months ago

Hi @atantos , last month I did some code refactoring and changed coom.jl. As a result, your code cannot be merged automatically. At the same time, I have some comments about your changes:

atantos commented 6 months ago

Hi @rssdev10 !

Thank you for your comments and am sorry for the delay!

  • you changed an order of arguments in many methods. This will break compatibility of the API for current users. Instead of direction::Bool as an argument, I think it would be better to use the keyword argument mode and declare it by this way: CooMatrix(crps::Corpus, terms::Vector{String}; window::Int=5, normalize::Bool=true; mode=:default) with possible values of the mode as [:default, :directional]

Am not sure which ones you mean. I placed the direction argument in the fourth position everywhere. However, I find your alternative with the mode argument elegant and I would have no objection going this way in the type declation.

  • regarding the implementation of the method function coo_matrix(..). You have a copy of another method. But at the same time, I see the only difference in the range of the inner cycle. for j in max(1, i-window):min(m, i+window) for the default case. And for j in i:min(m, i + window) for the directional case. If this is the only difference, the code may be implemented in the same method in the form:
for (i, token) in enumerate(doc)
    inner_range = if mode == :directional
        i:min(m, i + window)
    else
        max(1, i-window):min(m, i+window)
    end

     # looking forward
    @inbounds for j in inner_range
    ...
    end
end

Yes, actually I would not need to write a second method. One is enough with the if-else solution. So, no objection here either.

Thanks for the link! How should we proceed from now on? Should you make the changes and add them to the commit? Am really not that familiar with how it works in cases like this. And, lastly. There is still a conflict and am not sure whether it has to do with the changes you made, although it does not seem to be the case.

Your feedback and ideas are greatly appreciated!

Alex

rssdev10 commented 6 months ago

Hi Alex!

I placed the direction argument in the fourth position everywhere.

There is a set of arguments in the current API that people are already using as described in the documentation. But if you add something to the fourth position when a method has 5 arguments, this means that you are making this method incompatible with previous code where something else was expected at the 4th position. So, abstractly, the new arguments should not be in the same 4th position, but in the last position. And if it is with some default value, this means that even the changed methods will be compatible with the previous code. Anyway, a keyword argument looks better in your case.

Thanks for the link! How should we proceed from now on? Should you make the changes and add them to the commit? Am really not that familiar with how it works in cases like this. And, lastly. There is still a conflict and am not sure whether it has to do with the changes you made, although it does not seem to be the case.

The changes I made are already in the master branch. So all you need to do is git pull origin master and/or git merge master in your local branch. Then resolve conflicts, add tests, do commit and push it back into the same branch with or without git rebase.

Your feedback and ideas are greatly appreciated!

Let me know if you have any questions. Or, use Julia Slack for private questions.

atantos commented 6 months ago

I had to update this comment, since I did not have to do what I suggested earlier. So, I kept your advice with the inner_range. I made all the changes, I added tests for my case and also a few others for a case that was missing and, am looking forward for the pull request to be "approved".

Thanks again for the help, @rssdev10 !

Alex

rssdev10 commented 5 months ago

Thanks, merged in the master branch.