IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)
http://ips-lmu.github.io/EMU.html
23 stars 15 forks source link

create_itemsInLevel does not order annotation items chronologically in JSON files #273

Open rpuggaardrode opened 10 months ago

rpuggaardrode commented 10 months ago

When using create_itemsInLevel() to programmatically annotate, annotation items end up being ordered in the JSON files in the order they are added and not the chronological order relative to the sound file. This creates a bug in the webApp, such that can be difficult or impossible to select annotation items.

In my case I have a level of type EVENT. When I use the tab to select the next item, the webApp will select the item that was created next, and not the next item in the sound file. Also the highlighted item is not necessarily the one closest to the cursor, which makes it impossible to change or delete items. I suppose the JSON file should be reordered so that the samplePoints are chronological when using create_itemsInLevel().

MJochim commented 10 months ago

Strictly speaking, this is a bug in the EMU-webApp and not in emuR, since the _annot.json format does not specify that the items need to be in chronological order [1]. But it also wouldn’t hurt for emuR to store them chronologically.

[1] https://ips-lmu.github.io/The-EMU-SDMS-Manual/app-chap-fileFormats.html#subsec:app-chapFileFormatsAnnotJSON

MJochim commented 10 months ago

I have dug a bit and found out that emuR does in fact go out of its way to make sure they are stored chronologically. But there’s a bug. And very importantly, this bug does not only affect the webApp, it also affects query(). query() returns the items in the wrong order, because the field start_item_seq_idx of the segment list does not correspond to the chronological order, which is pretty bad.

Apparently the data frame passed to create_itemsInLevel() needs to include a db_uuid column. Unfortunately, that is not documented. I also think it is a mistake. This column should not be required.

@rpuggaardrode Can you try adding a db_uuid column to your data frame and see if this fixes your problem? The column can have any value other than NA. You can pass in "lol" if you feel like it. The value is ignored. You can also pass in my_db_handle$UUID to make it look cleaner.

For the record, here’s how to reproduce this bug:

create_emuRdemoData("~/")
db_ae = load_emuDB("~/emuR_demoData/ae_emuDB/")

itemsToCreate = data.frame(session = "0000",
                           bundle = "msajc003",
                           level = "Tone",
                           attribute = "Tone",
                           labels = "new",
                           start = 1000,
                           # db_uuid = db_ae$UUID ### <<- look here
                           )

create_itemsInLevel(db_ae,
                    itemsToCreate = itemsToCreate)

query(db_ae, "Tone =~ .*", bundlePattern = "003") %>% 
  dplyr::select(labels, start, end, start_item_seq_idx)

If you leave out db_uuid, the query() at the end returns a wrong order. The order is correct if db_uuid has any value other than NA.

And this is the offending line:

https://github.com/IPS-LMU/emuR/blob/543d5fcd66c665e2c0162ecfd83738498b1cf9b4/R/emuR-annotations_crud.R#L195

I don’t know why db_uuid.y and is.na() were used here in the first place. I will look for a better way once you confirm that we are talking about the same bug.

This probably only happens in EVENT type levels, because for SEGMENT and ITEM, the strategies for mixing existing and new items are very different (SEGMENT: refuse to do anything if there are already existing items on the level; ITEM: user has to pass in unique sequence indices).

rpuggaardrode commented 10 months ago

Yes, adding a db_uuid column fixed it, thanks!

Actually, I think this affects SEGMENT tiers as well. A few weeks ago I was trying to use create_itemsInLevel() to add some segment tiers. I set the start time using a start_item_seq_idx column, and didn't bother setting end times because I just wanted the interval to end when the next interval in the same level started. But instead of doing that, it inserted an interval that lasted until the end of the sound file and fully overlapped with the next interval in the same level (see pics below).

image image

It's very possible that I messed something up, but as far as I can tell this behavior shouldn't be possible at all in the webApp. I wonder if this had the same cause.

MJochim commented 6 months ago

@rpuggaardrode, you don’t happen to have an example of this problem with a SEGMENT level still available? I know it’s been more than three months, so no worries if you don’t. I’m asking because, as I understand, you had a SEGMENT level that already contained some segments and you used create_itemsInLevel() to add more. emuR should have flatly refused to do that. There is code in create_itemsInLevel() that is supposed to stop in such cases:

https://github.com/IPS-LMU/emuR/blob/fcf86945d24b35b67ba71a9cf8076f10309fbf17/R/emuR-annotations_crud.R#L280-L289

But apparently, that approach with querying using sessionPattern and bundlePattern is flawed.

MJochim commented 6 months ago

On another note, may or may not be related:

I am in the process of changing create_itemsInLevel() to some degree, with regard to this:

and didn't bother setting end times because I just wanted the interval to end when the next interval in the same level started.

Currently, even if you did set end times explicitly, they would be ignored and instead calculated based on the next interval’s start time; or in the case of the last segment on a level, based on the duration of the annotated media file. In the current design of create_itemsInLevel(), this is the only way how segment end times are generated.

Another user was bitten by this, because she wanted to copy the result of some query() call to a newly-created, empty level. That was fine for all segments but the last one. The last segment on the newly-created level now invariably ended at the end of the annotated file instead of the end of the original segment on the original level.

rpuggaardrode commented 6 months ago

Sorry, I've cleaned the problematic examples from both the emuDB and my code, and I'm not even completely sure how to recreate it. I wonder what's wrong with that check in emuR-annotations_crud, superficially it looks fine.

Currently, even if you did set end times explicitly, they would be ignored and instead calculated based on the next interval’s start time; or in the case of the last segment on a level, based on the duration of the annotated media file. In the current design of create_itemsInLevel(), this is the only way how segment end times are generated.

Right, this tracks with my experience at this point, and is maybe a little unfortunate because of workflows like the one you mentioned. I think intuitively people would like to treat these like intervals, but the current approach is more akin to treating them like points. I see now that the documentation is clear about needing only a start time, but calling it a start time is maybe a little misleading.

MJochim commented 6 months ago

Alright, thanks for your reply.

I think intuitively people would like to treat these like intervals,

And they absolutely should do that, since it’s a SEGMENT level and not an EVENT level.

but the current approach is more akin to treating them like points.

Yes and that can be a problem depending on what the user is expecting. I think both these perspectives are relevant use cases:

a) I already have start and end time in my data (e.g. because I am basing my new data on a query() result), my segments are therefore fully specified. b) I have a list of boundary times (i.e. points) and I want to turn them into segments. I don’t want to bother with the maths of turning boundaries into fully-specified pairs of start and end time. (Annoying details would be converting milliseconds into samples and possibly determining the exact duration of the annotated file).

I therefore intend to add a new parameter calculateEndTimeForSegments so the user can choose, and I hope to provide guidance through this choice via appropriate warnings and documentation. I’ll show you when I’m done, maybe you have some feedback.

rpuggaardrode commented 6 months ago

This sounds like totally the right approach imo, and having something like an optional column of end times in the itemsToCreate object should still be backwards compatible. Looking forward! this sounds like it'll ease some of my workflow -- if I've programmatically determined boundaries through some landmarks in the sound file, I usually end up with a format like a) and it takes some fiddling to get the format right.

MJochim commented 6 months ago

@rpuggaardrode: I have changed create_itemsInLevel() now. If you have time, you could install the current version via devtools::install_github("IPS-LMU/emuR") and have a look at the documentation. Any feedback is welcome, particularly on whether the documentation is now understandable, the defaults make sense and the naming of the new parameters.