DISSINET / InkVisitor

An open-source, browser-based front-end application for the collection of complex structured data from textual resources in history and the social sciences into a RethinkDB database for further analysis.
BSD 3-Clause "New" or "Revised" License
10 stars 3 forks source link

Random sampler of verbs under Segmenter #2114

Open davidzbiral opened 4 months ago

davidzbiral commented 4 months ago

As a development feature of the statement segmentor, introduce a special functionality: instead of fully covering the selected span (or complete full-text, rather, in this case?), sample the verbs in the span in the same way as we did it for gold-standard person matching dataframes, and create statements in the current T (typically whole register) only around these randomly sampled verbs. Let the user to either receive a given % of verbs and thus statements (after the infinitive + definite verb correction, typically for modals), or a given absolute number. Default = %. In T statement lists generated by such a procedure, randomize the order of statements, but add also their anchors in the usual way as Statement segmentor does (incl. with user's confirmation) and this order of anchors in text will allow us later to find the order of appearance if needed.

This sample, after manual annotation, will then be able to serve as randomly sampled gold standard data and training data for that Resource (and beyond), e.g.:

Make the function a little "hidden" - not really hidden, but avoid misclicking on it, it has only special uses. Use Gideon's already-existing for gold-standard dataframes.

davidzbiral commented 3 months ago

Moving to an earlier milestone, we will need earlier.