CouncilDataProject / cdp-data

Data Utilities and Processing Generalized for All CDP Instances
https://councildataproject.org/cdp-data
MIT License
5 stars 4 forks source link

feature/n-gram-trends #3

Closed evamaxfield closed 2 years ago

evamaxfield commented 2 years ago

And so begins the analysis process for a CDP paper. First up a quick and easy ngram history retrieval function and plotting notebook.

Notebook render: https://github.com/CouncilDataProject/cdp-data/blob/feature/n-gram-trends/notebooks/plot_keyword_history.ipynb

The image / final plot render isn't showing up for me on the branch view so here is the actual produced image: visualization

You can give it a go if you would like, simply install with pip install -e .[plot] or pip install -e .[dev] then open up the notebook.

Tagging @kristopher-smith because I think you may have interesting comments on this but no pressure if not, I'm just going to keep chugging away on random stuff.

codecov[bot] commented 2 years ago

Codecov Report

Merging #3 (21a29e6) into main (fce8cc7) will increase coverage by 9.71%. The diff coverage is 86.25%.

@@            Coverage Diff             @@
##             main       #3      +/-   ##
==========================================
+ Coverage   76.47%   86.18%   +9.71%     
==========================================
  Files           5       10       +5     
  Lines          17      275     +258     
==========================================
+ Hits           13      237     +224     
- Misses          4       38      +34     
Impacted Files Coverage Δ
cdp_data/keywords.py 79.85% <79.85%> (ø)
cdp_data/datasets.py 88.15% <88.15%> (ø)
cdp_data/constants.py 100.00% <100.00%> (ø)
cdp_data/tests/test_keywords.py 100.00% <100.00%> (ø)
cdp_data/utils/__init__.py 100.00% <100.00%> (ø)
cdp_data/utils/db_utils.py 100.00% <100.00%> (ø)
cdp_data/utils/fs_utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fce8cc7...21a29e6. Read the comment docs.

evamaxfield commented 2 years ago

A tiny bit of work to do on documentation but I think I am "done" on the core ngram trends code!! :tada:

Here is the "relevancy history" i.e. the tfidf values over time -- this is useful to see when a keyword is maximally relevant (scores better than normal against itself) visualization(1)

But here is the actual ngram "usage history" i.e. the percent of that keyword over the percent of all keywords used that day -- this is useful to compare keywords against each other visualization

The takeaway here is: glad Kayla and Stephen helped me realize that we can't and shouldn't use the tfidf values because we can't compare different plots against each other using those values.

For example in the first visualization (tfidf values on the y-axis) naively you might think that "police" as a keyword is used more than "transportation" or "housing" but really its just that "the most relevant meetings to a search of 'police' would result in meetings from a while ago" (the higher scoring meetings i.e. the most relevant meetings, are from last year) -- while "there isn't really a difference in relevance for housing by date" (housing meetings are generally the same level of relevance / discussion and usage is somewhat constant).

If we wanted to see the actual usage, the second plot shows us that, the use of the word "police" / "policing" is going slightly down over time (which makes sense and tracks with the tfidf plot), but in comparison to the "housing" / "house" / "houses" plot, housing is generally discussed more than policing. And transportation is rarely discussed at all (probably because Pederson rarely holds a transporation committee meeting).