IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)
http://ips-lmu.github.io/EMU.html
23 stars 15 forks source link

Proposal: Database DESCRIPTION file, and how it could be used #222

Closed FredrikKarlssonSpeech closed 3 years ago

FredrikKarlssonSpeech commented 4 years ago

Following the discussion of how to handle metadata (#130) , I think there is a need to separate metadata set at the database level into two categories:

1) metadata of the same kind as metadata set at the bundle or session levels. You could use this for setting default values for bundles (and sessions) in a database for instance. Or, as general information that you will use in your analys of segmentlists or trackdata later.

2) Information that describes the database. This is not really metadata actually, as it does not provide useful information regarding an extracted segment. The URL of a project web site is perhaps not a productive addition to the description of a production of an [s] for instance. The same goes for information on how the database data may be used.

So, once you solve #130, you will not have a good solution for 2) above (only 1).

I instead propose that

MJochim commented 4 years ago

I agree with your point about (1) being in #130 and (2) not. I do not yet have a clear vision of the specifics, though. In the past, our idea has been that researchers can create database documentation in any form and format they wish and stuff it into the database directory. emuR would just ignore any file that is not named _DBconfig.json, _emuDBcache.sqlite, or *_ses/.

Could we define exactly what kind of information goes in DESCRIPTION and _DBconfig.json, respectively? Should the two really be distinct files, or do we just need schema extensions for the db config? Do we also want to recommend users to create a README file, like GitHub does?

  • that extraction of important information from the DESCRIPTION file, such as how the data may be used, is added to the information included in segmentlist and trackdata objects, as well as attributes to tibbles that capture the same data. That way, the information on how the data may be used is available to the user as part of the extracted data.

This blurs the line between (1) and (2) – however I imagine there might be some merit here to gain. Can you come up with specific examples? Are you thinking licensing issues? I am not convinced licensing should be a part of each and every observation.

  • that print methods for segmentlist and trackdata classes should be expanded so that they will also show how the data may be used and information of similar nature whenever you look at some extracted data. Perhaps there has to be some truncation set in place here if someone includes a full license text into the DESCRIPTION file? Or, perhaps a reminder to read the description file in this case?

I think it fits better in a summary(emuDBhandle) call (it just occurred to me that print.emuDBhandle should output the same as summary @raphywink).

I think we need more examples, examples of what fields would be mandatory, what kind of optional fields people might introduce. I am afraid it is very hard to reach consensus here.

raphywink commented 3 years ago

As this is more of an add on #130 + #231 I'll close it as it is cross-referencened anyway. I hope I'll get around to looking at the meta data issue soonish... but can't promise anything