dedupeio / dedupe-variable-ilcs

Dedupe variable for Illinois Compiled Statute (ILCS) codes
MIT License
2 stars 1 forks source link

Better handling of attempted charges #3

Open jeancochrane opened 4 years ago

jeancochrane commented 4 years ago

There are two primary ways in which attempted charges are represented in our messy data:

  1. An (att) suffix is appended to the charge, like 430-6/5-2a1 (att)
  2. A separate charge code, 720-5/8-4, is prepended to the charge, like 720-5/8-4 720-570/402

Pattern 1) is relatively simple to detect, and we have a dedicated label for it in the ILCS parser. Pattern 2) is harder to detect since it comes out looking like two entirely separate charges. In addition, the existence of the two different types of patterns makes it difficult to match against the canonical set, since we can only have one representation of an attempted charge in our canonical set.

There are a few ways I can think to approach this:

What do you think @fgregg?

fgregg commented 4 years ago

I'd recommend the second approach. Should be able to do it with something like a look ahead or look behind in the regex. https://www.regular-expressions.info/lookaround.html

jeancochrane commented 4 years ago

Gotcha. A regex approach to looking for the charge prefix is going to be tough because the data are so messy. Here's a quick sample using "attempted" matches from our training data:

messy clean
720-5/8-4a (att) 720-5/8-4 720-570/402
720-5/16-16-a (att) 720-5/8-4 720-5/16-1
720ilcs570/401(a)(2)(a) (att) 720-5/8-4 720-570/401
720-5/8-4(720-5/17-3(a)(2)) 720-5/8-4 720-5/17-3
720/5 8-4a 720-5/8-4 720-5/16-25
720-5/8-4(720-5/9-1a 720-5/8-4 720-5/9-1
720-8-4(720-5/18-1(a) 720-5/8-4 720-5/18-1(a)
720-5/8-4/17-3 720-5/8-4 720-5/17-3
720 5/8-4 17-3 720-5/8-4 720-5/17-3
20-5/8-4-a//720-5/18-1 720-5/8-4 720-5/18-1(a)
720 5/8-4 18-2a4 720-5/8-4 720-5/18-2
720 - 5/8-4(19-3(a)) (att) 720-5/8-4 720-5/19-3
720 8-4(a) 5 16a-3(a) 720-5/8-4 720-5/16-25
720-5/8-4(720 5/19-3a) 720-5/8-4 720-5/19-3
720 5/8-4(a) 720-5/8-4 720-5/31-4

I'm going to give it a try and see if I can write a lookahead that will match all these cases.

jeancochrane commented 4 years ago

Not much luck with the regex so far. Since the attempted prefix charges are so messy, another idea I had was to add new labels for them, e.g. AttemptedChapter and AttemptedActPrefix. That way we wouldn't have to write explicit rules for the attempted prefix and could instead try to get the model to learn the pattern.

However, I did a quick spike to add these ~15 patterns as training data (to about 180 existing pairs) with new Attempted labels and it didn't change much. It may just not be a good balance of training data, but I'm wondering if maybe it's because the features aren't rich enough to capture the pattern. In particular, the most important feature in my mind is the number of tokens, since a string with more than 6 tokens is dramatically more likely to have its first tokens be an attempted prefix charge. Is there a way to capture that kind of string-level (as opposed to token-level) feature in the parserator idiom @fgregg?

fgregg commented 4 years ago

let me ponder on this.

jeancochrane commented 4 years ago

I decided to experiment with adding a feature to each token representing the length of the full token sequence in https://github.com/datamade/ilcs-parser/pull/3/commits/7a0e5caa8cfec16518d07078f14700708fbf4629. It seemed to work properly on the first couple of patterns I threw at it, as demonstrated in the new test in https://github.com/datamade/ilcs-parser/pull/3/commits/eb223bcf776fa028e3f529e5042272b013ac83f3. I'm going to try plugging this in to the deduplication pipeline and see how it affects performance.