IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)
http://ips-lmu.github.io/EMU.html
23 stars 15 forks source link

Levels in emuRsegs #214

Closed FredrikKarlssonSpeech closed 5 years ago

FredrikKarlssonSpeech commented 5 years ago

Hi,

I would like to suggest two small revisions to emuRsegs that I think would make it easier to work with the class going forward.

First, I think the 'level' column of the data structure is not accurately named. It should really be attribute, as that is what we are querying. I demonstrate this using the ae database:

> list_levelDefinitions(ae_test)
          name    type nrOfAttrDefs        attrDefNames
1    Utterance    ITEM            1          Utterance;
2 Intonational    ITEM            1       Intonational;
3 Intermediate    ITEM            1       Intermediate;
4         Word    ITEM            3 Word; Accent; Text;
5     Syllable    ITEM            1           Syllable;
6      Phoneme    ITEM            1            Phoneme;
7     Phonetic SEGMENT            1           Phonetic;
8         Tone   EVENT            1               Tone;
9         Foot    ITEM            1               Foot;
> head(query(ae_test,"Accent=~ .*"))
segment  list from database:  ae 
query was:  Accent=~ .* 
  labels    start      end session   bundle  level type
1      S  187.425  674.175    0000 msajc003 Accent ITEM
2      W  674.175  739.925    0000 msajc003 Accent ITEM
3      S  739.925 1289.425    0000 msajc003 Accent ITEM
4      W 1289.425 1463.175    0000 msajc003 Accent ITEM
5      W 1463.175 1634.425    0000 msajc003 Accent ITEM
6      W 1634.425 2150.175    0000 msajc003 Accent ITEM
> str(.Last.value)
Classes ‘emuRsegs’, ‘emusegs’ and 'data.frame': 6 obs. of  16 variables:
 $ labels            : chr  "S" "W" "S" "W" ...
 $ start             : num  187 674 740 1289 1463 ...
 $ end               : num  674 740 1289 1463 1634 ...
 $ utts              : chr  "0000:msajc003" "0000:msajc003" "0000:msajc003" "0000:msajc003" ...
 $ db_uuid           : chr  "0fc618dc-8980-414d-8c7a-144a649ce199" "0fc618dc-8980-414d-8c7a-144a649ce199" "0fc618dc-8980-414d-8c7a-144a649ce199" "0fc618dc-8980-414d-8c7a-144a649ce199" ...
 $ session           : chr  "0000" "0000" "0000" "0000" ...
 $ bundle            : chr  "msajc003" "msajc003" "msajc003" "msajc003" ...
 $ start_item_id     : int  2 24 30 43 52 61
 $ end_item_id       : int  2 24 30 43 52 61
 $ level             : chr  "Accent" "Accent" "Accent" "Accent" ...
 $ start_item_seq_idx: int  1 2 3 4 5 6
 $ end_item_seq_idx  : int  1 2 3 4 5 6
 $ type              : chr  "ITEM" "ITEM" "ITEM" "ITEM" ...
 $ sample_start      : int  3749 13484 14799 25789 29264 32689
 $ sample_end        : int  13483 14798 25788 29263 32688 43003
 $ sample_rate       : int  20000 20000 20000 20000 20000 20000
 - attr(*, "query")= chr "Accent=~ .*"
 - attr(*, "type")= chr "segment"
 - attr(*, "database")= chr "ae"

This is a breaking change, but I think it is necessary to correct it in order to make the system easier for users and developers wanting to help out. And, the name of the column is actually wrong.

The second suggestion is smaller, but I think helpful. Since the level that has a given attribute is known at query time, I would like the level to be stored as an attribute of the emuRsegs object when it is created. Like "query","type" and "database" is now. This would allow restructuring https://github.com/IPS-LMU/emuR/blob/151c03b801cadf7b19fbde120788090ac60c6500/R/emuR-create_seglists.R#L2 to a 'as_tibble.emuRsegs' that I think would be helpful to have going forward.

FredrikKarlssonSpeech commented 5 years ago

The title did not say much.

raphywink commented 5 years ago

This is fixed in version 2.0.2 that I released yesterday! The default resultType is now tibble with the following columns:

names(sl)
 [1] "labels"             "start"              "end"                "db_uuid"            "session"           
 [6] "bundle"             "start_item_id"      "end_item_id"        "level"              "attribute"         
[11] "start_item_seq_idx" "end_item_seq_idx"   "type"               "sample_start"       "sample_end"        
[16] "sample_rate"    

"level" and "attribute" are sep. cols now and you are correct that "level" used to be misleading!

FredrikKarlssonSpeech commented 5 years ago

Fantastic! This makes it very clear what data structure should be focused on going forward.