Closed evamaxfield closed 2 years ago
Merging #3 (21a29e6) into main (fce8cc7) will increase coverage by
9.71%
. The diff coverage is86.25%
.
@@ Coverage Diff @@
## main #3 +/- ##
==========================================
+ Coverage 76.47% 86.18% +9.71%
==========================================
Files 5 10 +5
Lines 17 275 +258
==========================================
+ Hits 13 237 +224
- Misses 4 38 +34
Impacted Files | Coverage Δ | |
---|---|---|
cdp_data/keywords.py | 79.85% <79.85%> (ø) |
|
cdp_data/datasets.py | 88.15% <88.15%> (ø) |
|
cdp_data/constants.py | 100.00% <100.00%> (ø) |
|
cdp_data/tests/test_keywords.py | 100.00% <100.00%> (ø) |
|
cdp_data/utils/__init__.py | 100.00% <100.00%> (ø) |
|
cdp_data/utils/db_utils.py | 100.00% <100.00%> (ø) |
|
cdp_data/utils/fs_utils.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update fce8cc7...21a29e6. Read the comment docs.
A tiny bit of work to do on documentation but I think I am "done" on the core ngram trends code!! :tada:
Here is the "relevancy history" i.e. the tfidf values over time -- this is useful to see when a keyword is maximally relevant (scores better than normal against itself)
But here is the actual ngram "usage history" i.e. the percent of that keyword over the percent of all keywords used that day -- this is useful to compare keywords against each other
The takeaway here is: glad Kayla and Stephen helped me realize that we can't and shouldn't use the tfidf values because we can't compare different plots against each other using those values.
For example in the first visualization (tfidf values on the y-axis) naively you might think that "police" as a keyword is used more than "transportation" or "housing" but really its just that "the most relevant meetings to a search of 'police' would result in meetings from a while ago" (the higher scoring meetings i.e. the most relevant meetings, are from last year) -- while "there isn't really a difference in relevance for housing by date" (housing meetings are generally the same level of relevance / discussion and usage is somewhat constant).
If we wanted to see the actual usage, the second plot shows us that, the use of the word "police" / "policing" is going slightly down over time (which makes sense and tracks with the tfidf plot), but in comparison to the "housing" / "house" / "houses" plot, housing is generally discussed more than policing. And transportation is rarely discussed at all (probably because Pederson rarely holds a transporation committee meeting).
And so begins the analysis process for a CDP paper. First up a quick and easy ngram history retrieval function and plotting notebook.
Notebook render: https://github.com/CouncilDataProject/cdp-data/blob/feature/n-gram-trends/notebooks/plot_keyword_history.ipynb
The image / final plot render isn't showing up for me on the branch view so here is the actual produced image:
You can give it a go if you would like, simply install with
pip install -e .[plot]
orpip install -e .[dev]
then open up the notebook.Tagging @kristopher-smith because I think you may have interesting comments on this but no pressure if not, I'm just going to keep chugging away on random stuff.