Closed prsteele closed 1 year ago
Let's move the
maple
folder into theanalysis
folder
I'm fine putting the code anywhere, although having a top-level entrypoint for whatever Python we end up using is appealing to me.
and give it a more specific name.
I intended this to be the root of any Python written for this project, so in that sense maple
is about as specific as we could get. Do you have an idea for alternatives?
(Having done precisely zero empirical digging in this project...)
Do we have opinions about commit messages structure, history structure, and so on? (E.g. are we cool with lots of small commits, or do we like squashing?)
Lots of small commits is good. We only squash if the messages are mostly related to formatting/noisy.
A single python source root makes sense. In that case we should move the existing python code in analysis
into maple
. Can the notebooks live in there or should maple
just be python modules?
Notebooks
Notebooks can live alongside it, or inside it. Notebooks aren't generally importable, but they'll be able to import the library code.
Visit the preview URL for this PR (updated for commit prsteele/maple@6b0ec5b):
https://digital-testimony-dev--pr871-psteele-analysis-irt5bzq0.web.app
(expires Fri, 30 Dec 2022 21:33:23 GMT)
🔥 via Firebase Hosting GitHub Action 🌎
Sign: bc0858669d4997df2a9165c2144bd1e2dbba0242
We want to leave this branch here in case anyone wants to fix it up, but closing for now to avoid clutter
This PR creates a basic framework for classifying action types; this is a precursor to classifying bill statuses based on a sequence of actions.
As a quick demonstration, we can run
to generate a SQLite3 database file (
bills.db
) containing bills and their associated actions. We can then runto classify each action in each bill; the results will be populated in
predictions.csv
. The first few lines of this file might look likeThe first column is an internal database identifier. The second column is the plain text description of an action. The third column is the predicted action type. The fourth column (in our example, all nulls) is the user-provided label for the action type.
When we ran this command, we got some summary information about how accurate our algorithm is in classifying action types based on the user labels:
Since we haven't yet provided any labels, we don't get much feedback. Let's edit the
predictions.csv
file (with, say, LibreOffice calc) to provide some labels:We can now save them labels with
and when we run
predict-regex
again, we get richer results:We can now see that we've failed to properly label a "hearing scheduled" action as such. (This is due to the simple regex patterns we use not understanding the word "rescheduled", as opposed to "scheduled".) This suggests that we should improve our results to correctly produce these labels.
In summary, this provides a feedback loop where improvements to the matching algorithm (or any new, more interesting algorithm) can have their accuracy measured (using user labels), and our understanding of the problem can be improved by providing new user labels.