Unable to import collection of TextGrids

IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)

http://ips-lmu.github.io/EMU.html

23 stars 15 forks source link

Unable to import collection of TextGrids #197

Closed FredrikKarlssonSpeech closed 6 years ago

FredrikKarlssonSpeech commented 6 years ago

I just noticed an issue when importing a collection of textgrid and wave files:

> convert_BPFCollection("/Users/frkkan96/Box Sync/professional_DDK", "Professionals","TestDB")
WARNING: No reference level has been declared. EMU database will be built without any symbolic links. Do you wish to continue? (y/n)y
Error in create_filePairList(sourceDir, sourceDir, bpfExt, audioExt) : 
  Both colomns in file pair list are empty! That means no files where found...

but there are files in the directory:

> list.files("/Users/frkkan96/Box Sync/professional_DDK",pattern = ".*[.][wav|TextGrid]+")
 [1] "K37hem.TextGrid"     "K37hem.wav"          "K39choklad.TextGrid" "K39choklad.wav"      "K41sommar.TextGrid" 
 [6] "K41sommar.wav"       "K43katt.TextGrid"    "K43katt.wav"         "K44skog.TextGrid"    "K44skog.wav"        
[11] "K47penna.TextGrid"   "K47penna.wav"        "K48helg.TextGrid"    "K48helg.wav"         "K54basket.TextGrid" 
[16] "K58fossil.TextGrid"  "K58fossil.wav"       "K60pontiac.TextGrid" "K60pontiac.wav"

Maybe this is related to the " " in the file path somehow, or the regex? I dit not see this issue in my earlier attempts.

raphywink commented 6 years ago

Shouldn't convert_BPFCollection() be convert_TextGridCollection()? convert_BPFCollection() is for BPF collections (see https://www.phonetik.uni-muenchen.de/Bas/BasFormatsdeu.html#Partitur)

FredrikKarlssonSpeech commented 6 years ago

Yes. Sorry, no excuse except lack of attention to what should have been obvious.

I am very embarrassed now so I want to change the subject to a related one - regarding convert_TextGridCollection. I was actually looking for a function to add TextGrid + wave files to an existing database. But it seems you can't, right?

At least all my attempts give me errors like this (and I am of course considering that I may be doing something silly again)

Error in convert_TextGridCollection("../speak_out/Data/", "ProfessionalsDB",  : 
  The directory /Users/frkkan96/Documents/forskning/Parkinson/ddkrythm/TestDB/ProfessionalsDB_emuDB already exists. Can not generate new emuDB if directory called ProfessionalsDB already exists!

And a second point for the same function (which I think is perhaps something to think about considering the point above)

> convert_TextGridCollection("/Users/frkkan96/Box Sync/professional_DDK", "Professionals","TestDB")
Error in create_filePairList(dir, dir, audioExt, tgExt) : 
  Not all TextGrid files found for wav files found in /Users/frkkan96/Box Sync/professional_DDK and /Users/frkkan96/Box Sync/professional_DDK

The point is that it seems right now that you should either be completely done with marking up all files in Praat to be able to import the files into emu, or do some manual sorting of files so that only transcribed files are in the import folder and then convert them (I think "importing" is more natural way to think about this as it is not done "in place", but I guess it depends on how you view this process) and then you need to use import_mediaFiles() to get the remaining files into the database for markup?

Would it not be great if we could use convert_TextGridCollection to 1) add TextGrid + wav file pairs to an existing database, 2) allow a session too for the imported files, and 3) allow the function to just gracefully add the files it could find where there is a matching TextGrid?

Again, sorry for the hurried "end of workday" post above.

raphywink commented 6 years ago

Merging two emuDBs is really trivial. You simply have to move the session/bundle folders from one to the other. You just have to be really sure that they are configured in the same way (simply check the database configuration section of the summary(emuDBhandle) function). Hence, the method of adding new files that you created in Praat is to convert the new TextGrids to a new temporary emuDB and configure that the same way as your existing emuDB and then move the folders over (this is usually 4-5 R commands incl. copying).

On the other hand, automatically guessing how a hierarchical structure like this (see B):

annotstruct

can be generated from this:

msajc003_praat

is extremely difficult (feel free to contribute code though if you have a simple solution to the problem ;-) )! The annotation structure modelling capabilities of EMU are very powerful and infering them form tiernames and guessing "oh this level has longer segments so it is probably above this other level with shorter segments" and so on is really difficult to do. Even if you know the structure, the build sequence order is not at all trivial to guess.

Just FYI: if you would stick to EMU it would makes things easier (import the new media files and then annotate them in the EMU-webApp).

FredrikKarlssonSpeech commented 6 years ago

Of course, and I think the point is that you should perhaps not be able to set a database structure that is different than the one you import into. The user should of course not be allowed to import TextGrid files that have tiers in them that are not in the database. I see if I can make the convert function do what I want to and make a contribution.

FredrikKarlssonSpeech commented 6 years ago

"Just FYI: if you would stick to EMU it would makes things easier (import the new media files and then annotate them in the EMU-webApp)."

I can't just right now actually. Some of the stuff that was in emu before and that you can now do in Praat but not the new version of emu (like scripted transcriptions) I really need. And for me, the thinking that links should be explicit always, and with no option to have them be derived by time by default, is for me actually a hindrance. Not a feature. The cases where I would need two levels not be linked by times is just - well it has not happened yet in my current work. And in Praat you get the ability to do rapid transcription on two levels simultaneously using just a single keypress, and stuff like that. So, it is just much quicker to work in Praat, at the moment. Nothing wrong with the emu editor, it is wonderful in other ways, but well. When you have lots of data, speed of markup matters, and an ability to do automated draft markup.

raphywink commented 6 years ago

@praat features vs. emu: that is fine (btw. scripted annotations are pretty high on our todo list) but then you'll just have to deal with the pain of switching back and forth (also fine) @explicit vs. implicit: I think we have to agree to disagree on that one. We at the IPS are very strong proponents of explicitly defining relationships (in the case of timeless annotations this is a must anyway)

FredrikKarlssonSpeech commented 6 years ago

Yes, I am aware. And, I realize that it is a valid view, if you have movement data for instance. But for a majority of people who needs a database system to handle their recordings and the would really benefit from a safe and verified DSL that gives correct results and get you a pitch track for exactly the segments you found without doing the most, sometimes, difficult thing to get right in programming - going back and forth in an object by index - as a novice programmer (perhaps). Well, for these people a more simple entry point to the system would really be of benefit. It now takes quite a few commands just to get it so that you can query for a vowel that is just really in the middle of a word using a query that starts with the word and then looks for vowels in it. Why should you have to link them really?

Don't get me wrong. I don't want you to change the defaults, because I think making it easier to work with movement data is a worthwhile effort. But please just don't make it too hard to do the simplest investigations. Perhaps an option that you could set in the database config to always link by time for all levels? Or something like that..

raphywink commented 6 years ago

Why should you have to link them really?

because explicit is better than implicit ;-). And I mean the difference we are talking about here is:

query(db, "[SomeLevel == X ^ otherLevel == Y]")

vs.

autobuild_linkFromTimes(db, superlevelName = "SomeLevel", sublevelName = "otherLevel")
query(db, "[SomeLevel == X ^ otherLevel == Y]")

so two commands vs. one (you could even remove those links post query with remove_linkDefinition())

FredrikKarlssonSpeech commented 6 years ago

I guess it depends on what your data is I guess. Creating links from time is not cheap if you do all files at once in a database.

Just a quick test below (9 files)

> load_emuDB("TestDB/Professionals_emuDB/") -> ddk
> list_levelDefinitions(ddk)
      name    type nrOfAttrDefs attrDefNames
1     Task SEGMENT            1        Task;
2 Syllable SEGMENT            1    Syllable;
3       CV SEGMENT            1          CV;
> add_linkDefinition(ddk,"Task","Syllable",type = "ONE_TO_MANY")
> add_linkDefinition(ddk,"Syllable","CV",type = "ONE_TO_MANY")
> query(ddk,"Syllable =~ .*") -> sylls
> summary(sylls)
segment  list from database:  Professionals 
query was:  Syllable =~ .* 
 with 5163 segments
> query(ddk,"CV =~ .*") -> segments
> summary(segments)
segment  list from database:  Professionals 
query was:  CV =~ .* 
 with 10225 segments

Ok, then to making links for this small database

> system.time(autobuild_linkFromTimes(ddk,"Task","Syllable"))
  INFO: Rewriting 9 _annot.json files to file system...
  |========================================================================================================================| 100%   user  system elapsed 
 22.304   0.256  23.150 
Warning message:
call dbDisconnect() when finished working with a connection 
> system.time(autobuild_linkFromTimes(ddk,"Syllable","CV"))
  INFO: Rewriting 9 _annot.json files to file system...
  |========================================================================================================================| 100%   user  system elapsed 
 22.572   0.235  23.065

So if 9 files takes 23 seconds to compute then I need a two lunch breaks back to back to just compute the links for just two level pairs. Sure it seems possible, but is also seems that it takes equal amount of time to compute links from times again on the entire database again right if you add a file? Do I have to recompute all links if I add a segment also?

I would agree that my databases may not be typical, but I still think it would be a great idea to have the ability to mark a link for "auto compute links by time" or in some other way make sure that once you insert a new segment that is completely dominated by another one on the time, that these are then linked. That is, possible to use in hierarchical queries.

raphywink commented 6 years ago

Yes exactly linking is very very expensive! If I understand you correctly you would propose doing the same thing as the autobuild_linkFromTimes() does (finding which segment is part of what other segment of another level) at query time (sometimes over multiple levels so the equivalent of multiple autobuild_linkFromTimes()). So every query would be as slow as doing autobuild_linkFromTimes() followed by a query() (minus the rewrite of annot.jsons). Also, you have to concider that querying is done way way more frequently than building the annotation structure of a database (quite often once be4 analysis).

Sure it seems possible, but is also seems that it takes equal amount of time to compute links from times again on the entire database again right if you add a file? Do I have to recompute all links if I add a segment also?

Would have to double check on this. But I think currently you would have to delete and then rebuild the links. Maybe in the distant future we could consider an "add only missing links" feature (priority fairly low as there is a very doable work-around).

I would agree that my databases may not be typical, but I still think it would be a great idea to have the ability to mark a link for "auto compute links by time" or in some other way make sure that once you insert a new segment that is completely dominated by another one on the time, that these are then linked. That is, possible to use in hierarchical queries.

That is something we could concider as a feature in the EMU-webApp to automatically add links when inserting a new segment which is dominated by a segment level (once again fairly low priority because doing an autobuild once you are done annotating the database does exactly this (and yes you might have to wait a min or 2))

FredrikKarlssonSpeech commented 6 years ago

No, of course not at query time. But, to the linkDef between two levels. So, when a the transcription for a specific bundle is edited/updated, then links are rebuild from time on that bundle (only!). This saves lots of time as only the bundle that we know has been updated will be recomputed.

Well, I would not mind waiting for a procedure that I just run once and at the end of transcription and then be done. But waiting 25 minutes for recalculating times every time I find something that should be fixed is not efficient. The database system could be the tool that could help me find silly errors (e.g. silly double events where there should be only one that are close together so you can't see it in the interface, obvious errors in the transcription labels due to key sequencing errors, well things like that).

Not sure what the doable workaround is now. I of course have only a partial idea of how you are thinking about emu usage and what is on the roadmap for the system. Adding an "Insert and link" keystroke to the web app would be wonderful. It may even be quite frequently used.. ;-)