Open baileyb0t opened 3 months ago
We want to ignore field headers that are likely to congest the common string detection, so my thinking was to combine allegations
(the cleaned string that follows "SUMMARY OF ALLEGATION(S)") with the text from findings_of_fact
.
Both are technically written by the DPA, but the allegations
reiterate the complainant's narrative broken out into individual allegations of misconduct. The findings_of_fact
discuss the collected evidence and interviews conducted by DPA associated with a given allegation.
I think the findings_of_fact
are more likely to be formulaic and contain common phrases (enough that we already have a few indicators set up to detect these phrases, ie. jlp
for "justified, lawful, and proper"), but may be important to consider when the summary of allegations is extremely brief.
Open to suggestions! Maybe it's more reasonable to just use the allegations only, since that's more connected the language of the original complaint? Zac was looking for ways to improve the language of submitted complaints based on what has historically received more traction in terms of sustained allegations.
May also be worth noting here that when allegations are added by the DPA (or OCC, since we also have those older records), these tended to be sustained more often due to the nature of how they come to exist.
It might be best to exclude these if our interest is in features of the original complaint/allegation language, though I'm not sure and will process them altogether for now.
My initial thought is we should treat allegations and findings of fact separately rather than combining them. Although both are written by DPA, in terms of strategizing how to word complaints we submit in the future, we'll want to focus on language in allegations. That said, it might be the case that there is language in the findings that points to specific ways of presenting important evidence, that we can adopt for when we submit complaints. I just think it's worth trying it with just the allegations for this purpose (in addition to combined which we can also set up, the code should not be much different).
Ideally, we structure this as a classification problem, where we classify whether the complaint is sustained or not. And we use extracted keyphrases as features, along with: type of alleged misconduct, OCC vs. DPA, and whether or not the complaint was original or added by the agency. Those are the three major features that I imagine we'd want to be able to "control" for in some way (maybe there are others?). From there we can examine variable importance for specific keyphrases, and given your observation here, we should stratify those by whether the complaint was added or not (and perhaps focus on DPA vs OCC). In addition to auto-extracted keyphrases, make sure we include those that Zac has already come up with as features.
That makes sense, I'll work with just the allegations
text for now.
I did remember the goal to make it a classification problem and agree that category_of_conduct
, report_type
, and dpa_added|occ_added
plus the extracted keyphrases should be the starting set of features.
I'm fiddling with the model parameters to make sure the extracted keyphrases are useful (top two results are consistently "officer" and "complainant") and I'll go back to confirm which of the existing phrases we indicate were suggested by Zac so we can pull those in.
Thanks for the feedback!
Zac asked last week if we could check whether there are common phrases present in the allegations that contribute to whether an allegation is sustained.
There's an open source python library,
pke
, that could be useful for identifying common phrases. I took a pass at it with theTopicRank
extractor but found that we may be incorrectly separating allegations that trail onto the next page, so I'll need to fix that issue before continuing to process phrases from the allegations.