Closed dennlinger closed 2 years ago
Unfortunately also includes some preliminary experiments on MLSUM in this PR, which should technically not be in here, but a separate PR. Importantly, though, these match exactly the results obtained by Philip May, whose post is linked above.
I also realized that a single function analyzing samples might be counterproductive, since it is not clear whether/how many samples have several issues (i.e., empty samples will also turn up as having a "longer/equal summary than reference text length". Instead, these remain as separate functions for now that might be tied together later.
Also realized that there are some inconsistencies wrt the lemmatization (see issue #33), which is not fully propagated to "lower-level" functions yet.
Current draft proposal of tools for the analysis of sequences. There are several ideas that are incorporated here:
Also includes minor bug fix for existing aligners.