DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.3k stars 559 forks source link

[WIP] Add WordCorrelationPlot visualizer #1243

Closed pdeziel closed 2 years ago

pdeziel commented 2 years ago

This PR will fix #148 which requested a WordCorrelationPlot visualizer to allow the user to visualize the correlations between words in the corpus. I came across this issue in the backlog and I can see myself using it so I'd be happy to work on it.

Sample Code and Plot

TODO

TODOs and questions

Still to do:

CHECKLIST

codecov[bot] commented 2 years ago

Codecov Report

Merging #1243 (a2b80a5) into develop (7e238e1) will increase coverage by 0.13%. The diff coverage is 98.83%.

@@             Coverage Diff             @@
##           develop    #1243      +/-   ##
===========================================
+ Coverage    90.62%   90.75%   +0.13%     
===========================================
  Files           92       93       +1     
  Lines         5214     5300      +86     
===========================================
+ Hits          4725     4810      +85     
- Misses         489      490       +1     
Impacted Files Coverage Δ
yellowbrick/text/correlation.py 98.82% <98.82%> (ø)
yellowbrick/text/__init__.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7e238e1...a2b80a5. Read the comment docs.

lwgray commented 2 years ago

Thanks for opening a PR. Please move forward with the development of this feature. All the best

lwgray commented 2 years ago

@pdeziel bump Is there anything we can do to help with your PR?

pdeziel commented 2 years ago

@pdeziel bump Is there anything we can do to help with your PR?

Hi @lwgray, I think I have most of the functionality completed, although I still need to clean up the code comments, add documentation, etc.

There is one visual bug I was working on tracking down which I might need help with; the color mapping on the visualization doesn't seem correct (e.g., 0.29 in the below chart should be "bluer" than 0.16). This might have to do with pcolormesh expecting a value in the range [0,1] but the correlation value range is from [-1,1]. I would also ideally like to have the red and blue values swapped in the default color map (so something like BuYlRd) but that color map doesn't currently exist in YB.

correlation

lwgray commented 2 years ago

@pdeziel Awesome work... Maybe you can internal convert it to be between 0 and 1 while still display -1 to 1. Maybe @bbengfort or @rebeccabilbro can chime in on this. I guess in the meantime you can work on the docs. Also, make sure you update your branch.

bbengfort commented 2 years ago

@pdeziel @lwgray I'm wondering if the colors bug is a mismatch of the labels and the colors; e.g. a matrix got inverted somewhere incorrectly. If you look at (Tatsumi Kimishima, Nintendo) = .29, c=yellow; and (woman, play) = 0.04 c=blue, these cells are inverse of each other on the matrix but seem to have the other cell's color.

I'd look at the wc_display flip on L174 as a start, then check the top left and top right positions in X and Y.

pdeziel commented 2 years ago

@lwgray @bbengfort I've fixed the color mapping issue and added the documentation, so I think is ready for review again. There seems to be one CI error, but it's in a different module so I haven't figured out how it's related to my changes yet.

One thing I think would be useful to add in the future is a parameter where the user would only have to supply one word, and the visualizer finds the top correlated words in the corpus.