codeforboston / maple

MAPLE makes it easy for anyone to view and submit testimony to the Massachusetts Legislature about the bills that will shape our future.
https://mapletestimony.org
MIT License
44 stars 118 forks source link

WIP: Create a basic classification framework #871

Closed prsteele closed 1 year ago

prsteele commented 1 year ago

This PR creates a basic framework for classifying action types; this is a precursor to classifying bill statuses based on a sequence of actions.

As a quick demonstration, we can run

python -m maple load-bills --db-path bills.db --bills-file analysis/data/all-history-actions.csv

to generate a SQLite3 database file (bills.db) containing bills and their associated actions. We can then run

python -m maple predict-regex --db-path bills.db --predictions-file predictions.csv

to classify each action in each bill; the results will be populated in predictions.csv. The first few lines of this file might look like

action_id,action,prediction,label
1,Referred to the committee on House Ways and Means,referred,
2,"Reported, in part, by H4000",uncategorized,
3,Referred to the committee on Public Service,referred,
4,Senate concurred,concurred,
5,Hearing scheduled for 07/28/2021 from 01:00 PM-04:00 PM in Virtual Hearing,hearing_scheduled,
6,Hearing rescheduled to 07/28/2021 from 09:30 AM-12:00 PM in Virtual Hearing,uncategorized,
7,Bill reported favorably by committee and referred to the committee on House Ways and Means,referred,
8,Referred to the committee on The Judiciary,referred,
9,Senate concurred,concurred,

The first column is an internal database identifier. The second column is the plain text description of an action. The third column is the predicted action type. The fourth column (in our example, all nulls) is the user-provided label for the action type.

When we ran this command, we got some summary information about how accurate our algorithm is in classifying action types based on the user labels:

0 correct predictions, 0 incorrect predictions

Prediction counts:
prediction
referred              10418
concurred              9644
uncategorized          9437
hearing_scheduled      7279
date_change            2612
study_order            1899
reference              1674
rules_note             1405
engrossment             775
signed_by_governor      246
cancellation             83
dtype: int64
                 label  labeled  correct  incorrect  precision  recall  f1
0        uncategorized        0        0          0        0.0       1   0
1   signed_by_governor        0        0          0        0.0       1   0
2          engrossment        0        0          0        0.0       1   0
3    hearing_scheduled        0        0          0        0.0       1   0
4          study_order        0        0          0        0.0       1   0
5             referred        0        0          0        0.0       1   0
6            concurred        0        0          0        0.0       1   0
7            reference        0        0          0        0.0       1   0
8          date_change        0        0          0        0.0       1   0
9              reading        0        0          0        0.0       1   0
10        cancellation        0        0          0        0.0       1   0
11          rules_note        0        0          0        0.0       1   0

Since we haven't yet provided any labels, we don't get much feedback. Let's edit the predictions.csv file (with, say, LibreOffice calc) to provide some labels:

action_id,action,prediction,label
1,Referred to the committee on House Ways and Means,referred,referred
2,"Reported, in part, by H4000",uncategorized,
3,Referred to the committee on Public Service,referred,referred
4,Senate concurred,concurred,concurred
5,Hearing scheduled for 07/28/2021 from 01:00 PM-04:00 PM in Virtual Hearing,hearing_scheduled,hearing_scheduled
6,Hearing rescheduled to 07/28/2021 from 09:30 AM-12:00 PM in Virtual Hearing,uncategorized,hearing_scheduled
7,Bill reported favorably by committee and referred to the committee on House Ways and Means,referred,referred
8,Referred to the committee on The Judiciary,referred,referred
9,Senate concurred,concurred,

We can now save them labels with

python -m maple label --db-path bills.db --labels-file predictions.csv

and when we run predict-regex again, we get richer results:

6 correct predictions, 1 incorrect predictions

Prediction counts:
prediction
referred              10418
concurred              9644
uncategorized          9437
hearing_scheduled      7279
date_change            2612
study_order            1899
reference              1674
rules_note             1405
engrossment             775
signed_by_governor      246
cancellation             83
dtype: int64
                 label  labeled  correct  incorrect  precision  recall        f1
0        uncategorized        0        0          1        0.0     1.0  0.000000
1   signed_by_governor        0        0          0        0.0     1.0  0.000000
2          engrossment        0        0          0        0.0     1.0  0.000000
3    hearing_scheduled        2        1          0        1.0     0.5  0.333333
4          study_order        0        0          0        0.0     1.0  0.000000
5             referred        4        4          0        1.0     1.0  0.500000
6            concurred        1        1          0        1.0     1.0  0.500000
7            reference        0        0          0        0.0     1.0  0.000000
8          date_change        0        0          0        0.0     1.0  0.000000
9              reading        0        0          0        0.0     1.0  0.000000
10        cancellation        0        0          0        0.0     1.0  0.000000
11          rules_note        0        0          0        0.0     1.0  0.000000

We can now see that we've failed to properly label a "hearing scheduled" action as such. (This is due to the simple regex patterns we use not understanding the word "rescheduled", as opposed to "scheduled".) This suggests that we should improve our results to correctly produce these labels.

In summary, this provides a feedback loop where improvements to the matching algorithm (or any new, more interesting algorithm) can have their accuracy measured (using user labels), and our understanding of the problem can be improved by providing new user labels.

prsteele commented 1 year ago

Let's move the maple folder into the analysis folder

I'm fine putting the code anywhere, although having a top-level entrypoint for whatever Python we end up using is appealing to me.

and give it a more specific name.

I intended this to be the root of any Python written for this project, so in that sense maple is about as specific as we could get. Do you have an idea for alternatives?

prsteele commented 1 year ago

(Having done precisely zero empirical digging in this project...)

Do we have opinions about commit messages structure, history structure, and so on? (E.g. are we cool with lots of small commits, or do we like squashing?)

alexjball commented 1 year ago

Lots of small commits is good. We only squash if the messages are mostly related to formatting/noisy.

A single python source root makes sense. In that case we should move the existing python code in analysis into maple. Can the notebooks live in there or should maple just be python modules?

prsteele commented 1 year ago

Notebooks

Notebooks can live alongside it, or inside it. Notebooks aren't generally importable, but they'll be able to import the library code.

github-actions[bot] commented 1 year ago

Visit the preview URL for this PR (updated for commit prsteele/maple@6b0ec5b):

https://digital-testimony-dev--pr871-psteele-analysis-irt5bzq0.web.app

(expires Fri, 30 Dec 2022 21:33:23 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: bc0858669d4997df2a9165c2144bd1e2dbba0242

tommagnusson commented 1 year ago

We want to leave this branch here in case anyone wants to fix it up, but closing for now to avoid clutter