Better handling of attempted charges

jeancochrane commented 4 years ago

There are two primary ways in which attempted charges are represented in our messy data:

An (att) suffix is appended to the charge, like 430-6/5-2a1 (att)
A separate charge code, 720-5/8-4, is prepended to the charge, like 720-5/8-4 720-570/402

Pattern 1) is relatively simple to detect, and we have a dedicated label for it in the ILCS parser. Pattern 2) is harder to detect since it comes out looking like two entirely separate charges. In addition, the existence of the two different types of patterns makes it difficult to match against the canonical set, since we can only have one representation of an attempted charge in our canonical set.

There are a few ways I can think to approach this:

Data preprocessing: Check for the code 720-5/8-4 in the messy data and replace it with the (att) suffix. This will work well for cases of pattern 2) where there are two separate charges, but unfortunately sometimes the attempted code is the only recorded charge, and I'm not sure how those cases will behave. It'll also be tricky because the instances of 720-5/8-4 are not all formatted the same way.
Update the parser to catch the charge prefix: Try to get the parser to parse 720-5/8-4 as a separate label indicating an attempted charge. This is potentially the most semantically correct way to handle things but also seems difficult from a parsing perspective.

What do you think @fgregg?

fgregg commented 4 years ago

I'd recommend the second approach. Should be able to do it with something like a look ahead or look behind in the regex. https://www.regular-expressions.info/lookaround.html

jeancochrane commented 4 years ago

Gotcha. A regex approach to looking for the charge prefix is going to be tough because the data are so messy. Here's a quick sample using "attempted" matches from our training data:

messy	clean
720-5/8-4a (att)	720-5/8-4 720-570/402
720-5/16-16-a (att)	720-5/8-4 720-5/16-1
720ilcs570/401(a)(2)(a) (att)	720-5/8-4 720-570/401
720-5/8-4(720-5/17-3(a)(2))	720-5/8-4 720-5/17-3
720/5 8-4a	720-5/8-4 720-5/16-25
720-5/8-4(720-5/9-1a	720-5/8-4 720-5/9-1
720-8-4(720-5/18-1(a)	720-5/8-4 720-5/18-1(a)
720-5/8-4/17-3	720-5/8-4 720-5/17-3
720 5/8-4 17-3	720-5/8-4 720-5/17-3
20-5/8-4-a//720-5/18-1	720-5/8-4 720-5/18-1(a)
720 5/8-4 18-2a4	720-5/8-4 720-5/18-2
720 - 5/8-4(19-3(a)) (att)	720-5/8-4 720-5/19-3
720 8-4(a) 5 16a-3(a)	720-5/8-4 720-5/16-25
720-5/8-4(720 5/19-3a)	720-5/8-4 720-5/19-3
720 5/8-4(a)	720-5/8-4 720-5/31-4

I'm going to give it a try and see if I can write a lookahead that will match all these cases.

jeancochrane commented 4 years ago

Not much luck with the regex so far. Since the attempted prefix charges are so messy, another idea I had was to add new labels for them, e.g. AttemptedChapter and AttemptedActPrefix. That way we wouldn't have to write explicit rules for the attempted prefix and could instead try to get the model to learn the pattern.

However, I did a quick spike to add these ~15 patterns as training data (to about 180 existing pairs) with new Attempted labels and it didn't change much. It may just not be a good balance of training data, but I'm wondering if maybe it's because the features aren't rich enough to capture the pattern. In particular, the most important feature in my mind is the number of tokens, since a string with more than 6 tokens is dramatically more likely to have its first tokens be an attempted prefix charge. Is there a way to capture that kind of string-level (as opposed to token-level) feature in the parserator idiom @fgregg?

fgregg commented 4 years ago

let me ponder on this.

jeancochrane commented 4 years ago

I decided to experiment with adding a feature to each token representing the length of the full token sequence in https://github.com/datamade/ilcs-parser/pull/3/commits/7a0e5caa8cfec16518d07078f14700708fbf4629. It seemed to work properly on the first couple of patterns I threw at it, as demonstrated in the new test in https://github.com/datamade/ilcs-parser/pull/3/commits/eb223bcf776fa028e3f529e5042272b013ac83f3. I'm going to try plugging this in to the deduplication pipeline and see how it affects performance.

dedupeio / dedupe-variable-ilcs

Better handling of attempted charges #3