IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)
http://ips-lmu.github.io/EMU.html
23 stars 15 forks source link

Key value store for metadata #130

Open MJochim opened 7 years ago

MJochim commented 7 years ago

@raphywink: I think this is what we came up with when we discussed the issue several weeks back.

We are planning to enable metadata storage via JSON files. The plan is to allow – but not require – one file called metadata.json at the database, session and bundle level.

Database example

A database might be structured like this, e.g.:

The JSON files are scoped hierarchically. That is, in the above example:

Bundle-specific metadata.json files take precedence over session-specific files, which take precedence over database-wide files. This means that if a key is defined on multiple levels with different values, the value at the lowest level overrides the others.

Option 2

No values are overridden. Using the same key on multiple levels means that the bundles in question have multiple values for the given key.

metadata.json and DBconfig.json

While the metadata.json files allow for arbitrary key value pairs, all keys must be defined in the DBconfig.json file. The definitions might be something like this, at the top level of the DBconfig.json:

{
  "metadataDefinitions": [
    {
      "name": "someExampleKey",
      "type": "STRING"
    },
    {
      "name": "aVeryUsefulKey",
      "type": "STRING"
    }
  ]
}

metadata.json example

An actual metadata.json file might then look like this:

[
  [
    {
      "key": "someExampleKey",
      "value": "a really happy value"
    },
    {
      "key": "aVeryUsefulKey",
      "value": "another incredibly happy string value"
    }
  ],
  [
    {
      "key": "someExampleKey",
      "value": "this is some new value defined at some other time"
    },
    {
      "key": "aVeryUsefulKey",
      "value": "woohoo the newer value, the better"
    }
  ]
]

The keys must be the same as those defined in the DBconfig.json. We allow multiple sets of the same keys per file. One example where this might prove useful is when the metadata is about who worked on the annotations:

[
  [
    {
      "key": "annotatorName",
      "value": "Markus"
    },
    {
      "key": "annotationTime",
      "value": "November 2015"
    }
  ],
  [
    {
      "key": "annotatorName",
      "value": "Raphael"
    },
    {
      "key": "annotationTime",
      "value": "January 2016"
    }
  ]
]

Queries

.... How will these metadata be queried? ...

trblslp commented 4 years ago

Is this feature still on the horizon? I think a unified approach to storing and querying metadata within the emuR environment is sorely needed. Referring to metadata such as speaker gender, date of birth, L1, or any number of other variables is a normal part of phonetic analysis, but at present that information must be either made explicit in bundle names or housed in some separate file which links bundle names to metadata. I think the most logical place to keep metadata is with the files they describe.

Is there any reason why metadata could not just be a part of the existing .json files, rather than having a separate one for the metadata only? From what I understand, the functions which currently read the json files just skip over everything other than what they're looking for. Moreover, to my mind it would make sense to the user to follow the same concept as that used for levels and legal labels: you define at the top of the hierarchy what the metadata keys and their possible values are, and then each _annot.json will specify that bundle's values for each of the keys. It should then be possible to have an option in the query() call, like include_metadata = T, which then simply passes these variables as columns in the resulting seglist (ie. tibble), and reads the value for each row from the metadata in the bundle's _annot.json.

A further useful implementation I can think of is being able to use different formant tracking settings according to speaker gender as defined in the json metadata, or being able to restrict a query to bundles which meet certain metadata criteria.

FredrikKarlssonSpeech commented 4 years ago

Ok, I should say that I have an, I think, working solution for this. I need to complete the tests though, and ran out of time. I will try to submit the pull request before the end of this month.

I am not sure that I agree with you @trblslp that including metadata in the transcriptions is a good idea. You can already do that in Emu. I guess you could just insert an ITEM in the transcriptions of each recording and on a tier not not used for anything else, and make sure that the tier is for instance the parent of the top level tier of your transcriptions. The reason I think this is not usually what you want is that you likely would want data in a long format later on with one or many grouping factors for plotting and testing purposes. E..g. F1, F2 values in columns, and then the age, gender and regional variety of the speaker encoded in separate columns. You cant get that from a query command.

My new code, however, adds the ability to get metadata from json files at a session or bundle level (bundle metadata will take precedence over session metadata), and an ability to manually edit data and add new information via an excel file template export . The idea is then that this metadata is merged with the contents of a segmentlist or trackdata tibble and you then you have the tibble you will need later on when analysing or plotting the data.

Like I said, I will try to complete the tests soon.

trblslp commented 4 years ago

I guess you could just insert an ITEM in the transcriptions of each recording and on a tier not not used for anything else, and make sure that the tier is for instance the parent of the top level tier of your transcriptions.

I don't think that really cuts it though. The idea of using a kind of dummy tier for storing metadata strikes me as essentially a workaround for the absence of a dedicated metadata storage feature. For new users (like me), the appeal of emu is that the information stored in levels and their annotations is linguistically meaningful, rather than meaningful only for the purposes of smoothing the data querying process --- as one finds in a lot of Praat annotations that contain multiple tiers of redundant information like 'syll1', 'syll2' and so on. What I had in mind was not including metadata as an annotation, but merely in the *_annot.json file. You shouldn't have to requery every time you want to access the recording's metadata --- it should be available whenever a query is returned, because, as you say, it is important to just about any analysis of those results.

The idea is then that this metadata is merged with the contents of a segmentlist or trackdata tibble and you then you have the tibble you will need later on when analysing or plotting the data.

Great, that's exactly what I was trying to describe; let's hope it works!

raphywink commented 4 years ago

Just as a pointer: https://ips-lmu.github.io/The-EMU-SDMS-Manual/chap-annot-struct-mod.html#metadata-strategy-using-single-bundle-root-nodes but I am aware that that is far from optimal. @FredrikKarlssonSpeech looking 4ward to the pull request ;-).

FredrikKarlssonSpeech commented 4 years ago

Just one note. I had not seen this feature discussion (sorry) and one thing that strikes me is that I have not considered the possibility of adding metadata at the database level. I have implemented session and bundle level metadata (where bundle data overwrites defaults set for the session).

I am maybe not imaginative enough, but I don't see the use case for database metadata that you still need to use in your data processing... Please help me out here. @raphywink @MJochim

raphywink commented 4 years ago

@database metadata: not on a processing level (at least I can't think of anything expect for maybe in a multi-emuDB processing cases) but more from a documentation type of of standpoint. where was it recorded?, what was recorded?, etc.

trblslp commented 4 years ago

@FredrikKarlssonSpeech I think it might be useful to have a kind of declaration at the database level so that all metadata in the bundles must conform to the same format (just as levels referred to in bundles must conform to the level structure of the database). Is that what you currently have at the session level? In terms of actual metadata pertaining to a database maybe the db creator and their contact details, a web address for the project it is part of, information about permissions?

FredrikKarlssonSpeech commented 4 years ago

@raphywink Yes, perhaps it could be useful for settings default values. In a database of swedish dialects collected 95% in Sweden and 5% in Finland, it would be helpful if you only had to specify country for sessions that are not the default (Sweden origin).

@trblslp In a way, but no. :-) A strict enforcement of metadata I feel is very likely to just be a hindrance. But I think my code will have the same effect anyway, from how you edit the data manually.

"In terms of actual metadata pertaining to a database maybe the db creator and their contact details, a web address for the project it is part of, information about permissions?"

Yes, but this is maybe not information I need to get for each segment when I call "get_metadata" on a segmentlist? In this way, database level metadata of that kind is different compared to session or bundle metadata. If, for instance, I have all recordings fo a certain individual placed in a session, and each individual recording in separate bundles, a complete description of a segment or trackdata that I extract (which may actually be of use in the analysis of the data) is information that relates to the recording session (bundle) together with the information describing the person (session, such as gender and regional variety, it that is not changing). Data about the web adress of the database should perhaps not be included then and come into the structure in a different way perhaps.. I have to think about this.

MJochim commented 4 years ago

I’m glad this discussion has been revived and am also looking forward to your pull request @FredrikKarlssonSpeech. Although there already seems to be consensus that metadata on the database level make sense, I still want to throw in my two cents:

@ database metadata: not on a processing level (at least I can't think of anything expect for maybe in a multi-emuDB processing cases)

This. Having worked in a close collaboration with two other research institutes, my colleagues and I decided to create one emuDB per variety we recorded, all of them with the same annotation structure. It was the easiest to handle. We therefore needed a way of multi-emuDB processing. For many analyses, we extracted one data frame from every database and merged these data frames. I’m really not confident I’d recommend this scheme again, but anyway it’s what we did. Multi-emuDB processing is a thing ;-).

MJochim commented 4 years ago

I think it might be useful to have a kind of declaration at the database level so that all metadata in the bundles must conform to the same format (just as levels referred to in bundles must conform to the level structure of the database)

A strict enforcement of metadata I feel is very likely to just be a hindrance.

Please take care here. One thing that sets analysing emu-annotated data apart from analysing praat-annotated data is the very fact that the annotation structure is enforced. In Emu-SDMS, you have to try very hard to produce tier/level names of "Syll", "SYll", and "Syllable" in the same database, while it happens all the time in Praat (I'm obviously biased so do take this with a grain of salt. However it’s true ;-)).

Now especially when we plan to use the metadata as variables in our segment lists/data frames (which we do), it is vital that they be named consistently and that we know which columns we can expect our data frames to have. I would really want to see a metadata definition scheme similar to the existing level definitions – like @trblslp points out and also I myself did in the original posting.

FredrikKarlssonSpeech commented 4 years ago

A strict enforcement of metadata I feel is very likely to just be a hindrance.

Please take care here. One thing that sets analysing emu-annotated data apart from analysing praat-annotated data is the very fact that the annotation structure is enforced. In Emu-SDMS, you have to try very hard to produce tier/level names of "Syll", "SYll", and "Syllable" in the same database.

Yes, I agree. However, keep an open mind please when it comes to how one may make sure that you don't get synonym metadata fields.

while it happens all the time in Praat (I'm obviously biased so do take this with a grain of salt. However it’s true ;-)).

Well, yes, in vanilla Praat. In any serious work with lots of files you set up a script that construct the textgrid object for you when you start transcribing, and then there is no issue. But this is of course not a defence for Praat and also besides the point :-)

Now especially when we plan to use the metadata as variables in our segment lists/data frames (which we do),

This confuses me. Are you planning to have metadata queries? I like new features, but I cannot see that it's worth the time to implement it to be honest. Once I have a way of adding metadata to bundles and sessions (and possibly the database) then I will be fine in any analysis that I have done so far. The workflow will then be

1) query to get segment lists 2) append metadata columns to the tibble or segmentlist 3) filter out segments I don't need based on the metadata.

What would be useful though is the ability to use a column in the now augmented segment list to steer the trackdata processing from then on. Like @trblslp mentioned above, let the pitch or formant tracking algorithms have different parameter sets depending on whether the speaker is male or female or not, for instance. This would be very useful.

it is vital that they be named consistently and that we know which columns we can expect our data frames to have. I would really want to see a metadata definition scheme similar to the existing level definitions – like @trblslp points out and also I myself did in the original posting.

Yes, but unlike transcription levels, metadata needs to be just a bit more flexible as information that needs to be added to a bundle are not always known when you start your project, and may just involve a single file sometimes. You may have a patient that receives an operation that you need to keep in mind from a specific recording session (bundle then), or the medication may different in some way, the speaker may have a cold in that particular session, or yes, many things like that. Ok, you may then append the new required metadata kind into the definitions that you defined for metadata in your database, but anything you do in your database definition also increases the risk of you messing up your database. And adding metadata goes from simple to too difficult, and then I suspect that many will instead just have a "misc" metadata field that just makes it all pretty useless.

I know that I am speaking in the abstract when I say that I think I have a flexible but rigid workflow for metadata in place in my code. Just keep an open mind so that we may have a good discussion later.

MJochim commented 4 years ago

But this is of course not a defence for Praat and also besides the point :-)

Yes let's keep that away from here :-).

Now especially when we plan to use the metadata as variables in our segment lists/data frames (which we do),

This confuses me. Are you planning to have metadata queries?

Sorry for the confusion, I am only referring to what is being discussed in this thread, particularly this:

1. query to get segment lists

2. append metadata columns to the tibble or segmentlist

3. filter out segments I don't need based on the metadata.

In the end we will have metadata columns. No matter where exactly they came from (i.e. how your code generates them), we want them to be consistent and not have “language“ for some observations and “lang” for others.

What would be useful though is the ability to use a column in the now augmented segment list to steer the trackdata processing from then on. Like @trblslp mentioned above, let the pitch or formant tracking algorithms have different parameter sets depending on whether the speaker is male or female or not, for instance. This would be very useful.

I absolutely agree.

it is vital that they be named consistently and that we know which columns we can expect our data frames to have. I would really want to see a metadata definition scheme similar to the existing level definitions – like @trblslp points out and also I myself did in the original posting.

Yes, but unlike transcription levels, metadata needs to be just a bit more flexible as information that needs to be added to a bundle are not always known when you start your project, and may just involve a single file sometimes. You may have a patient that receives an operation that you need to keep in mind from a specific recording session [...]

I also agree here.

I know that I am speaking in the abstract when I say that I think I have a flexible but rigid workflow for metadata in place in my code. Just keep an open mind so that we may have a good discussion later.

Will do. I think we are indeed very much on the same page in that we want metadata structured. I guess I mistook your

A strict enforcement of metadata I feel is very likely to just be a hindrance.

We still might have a different take on how to allow for flexibility, or how much flexibility to allow. But this is speculation since yes:

I know that I am speaking in the abstract when I say that I think I have a flexible but rigid workflow for metadata in place in my code. Just keep an open mind so that we may have a good discussion later.

I want the metadata to be structured in some way, but beyond that I don’t want to be dogmatic. I will be very glad to discuss your approach and hopefully see it included in emuR.

raphywink commented 4 years ago

Just FYI me and @MJochim originally thought that metadata would only be used as a query filter (similar to bundlePattern) i.e. a search space reduction for the query. Having a new function like the get_metadata() you mentioned is a new concept... will have to think about it a bit more.

raphywink commented 4 years ago

I have to admit I just laughed out loud at

In any serious work with lots of files you set up a script that construct the textgrid

I my 10 years or so of doing this, I don't think I have ever come across a single largish textgrid collection that was "clean". So yes in theory that is correct and that is the way I would do it... but in my experience no one seems to be doing.

FredrikKarlssonSpeech commented 4 years ago

:-) Ok, well - labels is a different matter. But consistent textgrid tier names and properties, and also naming of files, are certainly doable. But, again, this is not the discussion here. I was just making the point that I, personally, believe that there are differences in how you would want to work with labels / tiers and metadata, and employing the same strategies for both may not be the way forward.

On Fri, Nov 8, 2019 at 2:36 PM Raphael Winkelmann notifications@github.com wrote:

I have to admit I just laughed out loud at

In any serious work with lots of files you set up a script that construct the textgrid

I my 10 years or so of doing this, I don't think I have ever come across a single largish textgrid collection that was "clean". So yes in theory that is correct and that is the way I would do it... but in my experience no one seems to be doing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IPS-LMU/emuR/issues/130?email_source=notifications&email_token=AABQ4NQQ2WBFG3BGZUH624LQSVTPTA5CNFSM4CTCAA3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDR6CIA#issuecomment-551805216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABQ4NQN3DJEX6ESFIXAC5DQSVTPTANCNFSM4CTCAA3A .

-- "Life is like a trumpet - if you don't put anything into it, you don't get anything out of it."