IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)
http://ips-lmu.github.io/EMU.html
23 stars 15 forks source link

If segment lists had an explicit sl_rowIdx column, data processing would become much nicer. #244

Closed FredrikKarlssonSpeech closed 3 years ago

FredrikKarlssonSpeech commented 3 years ago

I may be missing some obvious advantages of the current implementation, but I do see that having the "sl_rowIdx" set up for track data result in tibbles would be very nice to have also in the original query result tibble.

It is quite common to want to compute some summary statistics per segment, and also quite common that you need to include output from two tracks belonging to the same segments

The nicest way I can think of to do that is to do this currently:

# I need two simple tracks for the example
add_ssffTrackDefinition(ae,"f0","F0",onTheFlyFunctionName = "ksvF0")
list_ssffTrackDefinitions(ae)
query(ae,"Phonetic=V") -> a
afm <- get_trackdata(ae,a,"fm")
af0 <- get_trackdata(ae,a,"f0")

af0 %>%
  group_by(sl_rowIdx) %>%
  summarise(f0=mean(T1),f0med=median(T1)) -> af0m

afm %>%
  group_by(sl_rowIdx) %>%
  summarise(f1=mean(T1),f1med=median(T1)) -> afmm

a %>%
  tibble::rownames_to_column(var = "sl_rowIdx") %>% ## These rows I think should not be needed
  mutate(sl_rowIdx=as.integer(sl_rowIdx)) %>% ## These rows I think should not be needed
  left_join(af0m) %>%
  left_join(afmm)

Would it not be very nice to be able to do this instead?

a %>%
  left_join(af0m) %>%
  left_join(afmm)

and you still get the three data sources (one segment list and two different tracks) compiled nicely and safely without much risk of creating disordered data.

raphywink commented 3 years ago

While I get that it is easier if the column was there so the left_join() "just works", I don't really like the idea of doubling up on what rownames(sl) gives you as an additional column. Further, you don't really actually need the sl_rowIdx as it is meant more as a quick visual aid for the user. The unique key is made up of the columns "db_uuid", "session", "bundle", "start_item_id", "end_item_id") which is why you can do the following:

af0 %>%
  group_by(db_uuid, session, bundle, start_item_id, end_item_id) %>%
  summarise(f0=mean(T1),f0med=median(T1)) -> af0m

afm %>%
  group_by(db_uuid, session, bundle, start_item_id, end_item_id) %>%
  summarise(f1=mean(T1),f1med=median(T1)) -> afmm

a %>%
  left_join(af0m, by = c("db_uuid", "session", "bundle", "start_item_id", "end_item_id")) %>%
  left_join(afmm, by = c("db_uuid", "session", "bundle", "start_item_id", "end_item_id"))

So not too sure if I'd want to add this to the default result of query(). Adding those values after the fact is also just a one liner: a$sl_rowIdx = as.integer(rownames(a))

FredrikKarlssonSpeech commented 3 years ago

I agree that my suggestion borders on code golf. :-)

But, analysis code would become much more tidy. Of course, I don't have to support the code later, so I also see that it may not be worth the effort for a general adaptation of how query works.