Open jeancochrane opened 4 years ago
I'd recommend the second approach. Should be able to do it with something like a look ahead or look behind in the regex. https://www.regular-expressions.info/lookaround.html
Gotcha. A regex approach to looking for the charge prefix is going to be tough because the data are so messy. Here's a quick sample using "attempted" matches from our training data:
messy | clean |
---|---|
720-5/8-4a (att) | 720-5/8-4 720-570/402 |
720-5/16-16-a (att) | 720-5/8-4 720-5/16-1 |
720ilcs570/401(a)(2)(a) (att) | 720-5/8-4 720-570/401 |
720-5/8-4(720-5/17-3(a)(2)) | 720-5/8-4 720-5/17-3 |
720/5 8-4a | 720-5/8-4 720-5/16-25 |
720-5/8-4(720-5/9-1a | 720-5/8-4 720-5/9-1 |
720-8-4(720-5/18-1(a) | 720-5/8-4 720-5/18-1(a) |
720-5/8-4/17-3 | 720-5/8-4 720-5/17-3 |
720 5/8-4 17-3 | 720-5/8-4 720-5/17-3 |
20-5/8-4-a//720-5/18-1 | 720-5/8-4 720-5/18-1(a) |
720 5/8-4 18-2a4 | 720-5/8-4 720-5/18-2 |
720 - 5/8-4(19-3(a)) (att) | 720-5/8-4 720-5/19-3 |
720 8-4(a) 5 16a-3(a) | 720-5/8-4 720-5/16-25 |
720-5/8-4(720 5/19-3a) | 720-5/8-4 720-5/19-3 |
720 5/8-4(a) | 720-5/8-4 720-5/31-4 |
I'm going to give it a try and see if I can write a lookahead that will match all these cases.
Not much luck with the regex so far. Since the attempted prefix charges are so messy, another idea I had was to add new labels for them, e.g. AttemptedChapter
and AttemptedActPrefix
. That way we wouldn't have to write explicit rules for the attempted prefix and could instead try to get the model to learn the pattern.
However, I did a quick spike to add these ~15 patterns as training data (to about 180 existing pairs) with new Attempted
labels and it didn't change much. It may just not be a good balance of training data, but I'm wondering if maybe it's because the features aren't rich enough to capture the pattern. In particular, the most important feature in my mind is the number of tokens, since a string with more than 6 tokens is dramatically more likely to have its first tokens be an attempted prefix charge. Is there a way to capture that kind of string-level (as opposed to token-level) feature in the parserator idiom @fgregg?
let me ponder on this.
I decided to experiment with adding a feature to each token representing the length of the full token sequence in https://github.com/datamade/ilcs-parser/pull/3/commits/7a0e5caa8cfec16518d07078f14700708fbf4629. It seemed to work properly on the first couple of patterns I threw at it, as demonstrated in the new test in https://github.com/datamade/ilcs-parser/pull/3/commits/eb223bcf776fa028e3f529e5042272b013ac83f3. I'm going to try plugging this in to the deduplication pipeline and see how it affects performance.
There are two primary ways in which attempted charges are represented in our messy data:
(att)
suffix is appended to the charge, like430-6/5-2a1 (att)
720-5/8-4
, is prepended to the charge, like720-5/8-4 720-570/402
Pattern 1) is relatively simple to detect, and we have a dedicated label for it in the ILCS parser. Pattern 2) is harder to detect since it comes out looking like two entirely separate charges. In addition, the existence of the two different types of patterns makes it difficult to match against the canonical set, since we can only have one representation of an attempted charge in our canonical set.
There are a few ways I can think to approach this:
720-5/8-4
in the messy data and replace it with the(att)
suffix. This will work well for cases of pattern 2) where there are two separate charges, but unfortunately sometimes the attempted code is the only recorded charge, and I'm not sure how those cases will behave. It'll also be tricky because the instances of720-5/8-4
are not all formatted the same way.720-5/8-4
as a separate label indicating an attempted charge. This is potentially the most semantically correct way to handle things but also seems difficult from a parsing perspective.What do you think @fgregg?