allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Kylel/svm word predictor disjoint tokens #262

Closed kyleclo closed 1 year ago

kyleclo commented 1 year ago

Had an implementation bug in SVMWordPredictor that was introduced when I was trying to get words to respect token boundaries. Nothing too interesting, basically mis-implemented a for-loop such that it allowed for situations like:

token 3 is not punctuation -> word 3
token 4 is punctuation -> word 4
token 5 is not punctuation -> word 3

Fix is basically ripping out this for-loop into its own function _group_adjacent_with_exceptions() and adding specific tests.