IPS-LMU / EMU-webApp

The EMU-webApp is an online and offline web application for labeling, visualizing and correcting speech and derived speech data.
http://ips-lmu.github.io/EMU-webApp/
MIT License
51 stars 14 forks source link

Sorting vs. randomizing bundle lists #206

Open MJochim opened 7 years ago

MJochim commented 7 years ago

The EMU-webApp displays the bundle list in the left sidebar, grouped by session. The bundles within the sessions are displayed in the order they appear in the bundle list. They are neither randomized nor sorted.

The order matters because it is the most probable order in which editors will work on the bundles. I fear that any sorting/grouping in a bundle list will give rise to systematic errors (as opposed to random errors) in manual segmentation labour. We should consider removing the grouping.

However the feature remains very useful when browsing and not editing the database. We should probably just add a checkbox or something to enable/disable grouping.

The issue

In the emuDB Manager, when creating a new bundle list, the default is now to randomize the order (see https://github.com/IPS-LMU/emuDB-manager/issues/3). This is not grouped by session, so creating a bundle list with the items [a_ses/1_bndl, a_ses/2_bndl, b_ses/1_bndl, b_ses/2_bndl] might result in this order:

When I load this bundle list in the EMU-webApp, it effectively re-applies the "group by session" effect. The order in which I would probably work on these files is this:

Part of the randomization generated by the emuDB Manager is lost in the EMU-webApp.

Why does this matter?

There are two rationales behind the randomization in emuDB Manager. One is concerned with distribution of bundles among multiple editors. It doesn't matter much here. The important one for this issue is concerned with the consistency of a single editor's results. It ultimately affects the validity and reliability of the scientific analysis.

Manual segmentation or correction of MAUS results is a time-consuming, tedious task, and it is challenging to the editor's concentration. Often there is some room for interpretation, and a segment length can be both over- or underestimated. For the reliability of the analysis it would be best if it were always over- or always underestimated. A mix of the two would be acceptable if it happened completely at random.

Now I suspect that over the course of hours of manual segmentation labour, an editor's preference might reverse from time to time. If this is true and the bundle list is in any way sorted, this may heavily degrade the validity of analysis: Overestimation may (accidentally) occur in one experimental condition, and underestimation in another. Any effect found (or not found) may now be due to an inconsistency in segmentations.

I think this risk can be reduced if the bundle lists are sensibly randomized. We should therefore consider removing (or offering the option to remove) the "group by session" effect which is implemented in the EMU-webApp.