IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)
http://ips-lmu.github.io/EMU.html
23 stars 15 forks source link

Problem with convert_TextGridCollection() #187

Closed colan101 closed 6 years ago

colan101 commented 6 years ago

I have a problem with converting TextGrids from Praat into the emuDB format. Every time I try the following code for any of my TextGrids, I get an Error message that the first two lines of the TextGrid file do not match and that only long form TextGrids are currently supported, even though my TextGrids are (as far as I know) long form TextGrids. Even more confusing is that with the demo TextGrids that are included when installing emuR, I don't get this Error message, even though they look exactly like mine (I included a screenshot of one of the demo TextGrids(on the left) and one of mine (on the right) for comparison). Is it possible that there is another mistake that results in the same Error message? screenshot 23

Code:

library("emuR") convert_TextGridCollection("~/Desktop/Database", dbName = "NewDatabase", targetDir = "\~/Desktop")

Error message:

Error in TextGridToBundleAnnotDFs(tgPath, sampleRate = 2000, name = "tmpBundleName", : First two lines of TextGrid file do not match: File type; and: Object class. Only long form TextGrids are currently supported. Problem file is: ~\Database/FrogBook1_1.TextGrid

raphywink commented 6 years ago

That is strange... the TextGrid look fine. Do you know the encoding of the TextGrids files and what linebreaks they are using? If you could provide us with an example TextGrid file I could have a closer look into what seems to be causing the issue.... (can also send it to my work email if you don't want to post it on GitHub). It is always a bit difficult to debug a screenshot ;-)

colan101 commented 6 years ago

Thank you for your answer! I sent you an email to your work email.

raphywink commented 6 years ago

Your TextGrid file is encoded in Big-endian UTF-16 Unicode, with CRLF line terminators. Unfortunately that is one of the defaults that Praat has and we currently only support UTF-8 long form TextGrid files (I'll update the error message to include this information). It is a fairly easy fix:

screen shot 2018-03-26 at 16 51 38

Then you'll have to open and resave your textgrids (if you have loads of them it is probably worth writing a tiny praat script to do this for you)

Hope this helps...

colan101 commented 6 years ago

Thanks a lot! It is working now!

raphywink commented 6 years ago

Glad it worked! 👍

colan101 commented 6 years ago

I ran into another encoding problem while converting the TextGrids. There are some special characters in my TextGrid that are not converted correctly. For example, every "ú" in my TextGrid was replaced with "ú" in the JSON file. According to this http://www.i18nqa.com/debug/utf8-debug.html#dbg page, this "is being caused by UTF-8 bytes being interpreted as Windows-1252 (or ISO 8859-1) bytes". Could there be a problem in the TextGridToBundleAnnotDFs function?

colan101 commented 6 years ago

Update: The same happens with IPA Symbols. I inserted an "ʃ" and it was replaced with "ʃ".

raphywink commented 6 years ago

ok that is strange. Are they correct in your UTF-8 TextGrid files?

colan101 commented 6 years ago

They are correct in the UTF-8 TextGrids files. They are only wrong in the JSON files (and therefore in the EMU web-app as well). I can change them in the Emu web-app, but as soon as I save the changes they are wrong again.

raphywink commented 6 years ago

Oh dear... fireing up my virtual box instance of Windows 10 and will try to get this sorted. http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/ explains what the issue might be (fighting the urge not to say anything bad about Windows and R right now). Contemplating about switching all read and write operations to readr.

colan101 commented 6 years ago

Thanks a lot for your work! I already guessed that this is a Windows problem.

raphywink commented 6 years ago

I just tried the following and it seems to work under Windows 10 and Windows 7:

parsedJson = jsonlite::fromJSON('{"testIPA": "ʃ"}', simplifyVector = T)
jsonStr = jsonlite::toJSON(parsedJson, auto_unbox = TRUE, force = TRUE, pretty = TRUE)
writeLines(jsonStr, "/Users/raphael/Desktop/bla.json", useBytes = TRUE)

and it seems to work. This is how the serve() function writes the JSON file to disc (see here: https://github.com/IPS-LMU/emuR/blob/master/R/emuR-server.R#L426). Does this work on your system?

This doesn't have anything to do with the convert_TextGridCollection() function btw. They handle things differently and could be a different issue. Am just worried that the EMU-webApp save isn't working under windows...

colan101 commented 6 years ago

The code line works on my system, but this does not really solve my problem. If I understand it correctly the input and the output file are in JSON. The main problem is that the input file I have is a TextGrid (where everything is displayed correctly), and the JSON file that I get after running "convert_TextGridCollection" encodes the special characters wrong. Additionally, if I replace the wrong special characters in the JSON file with the correct ones and open the EMU web-app, everything is displayed correctly until I use the save button. Then the new saved JSON file encodes them wrong again.

raphywink commented 6 years ago

Good to hear that it works. Was just panicing because you wrote:

I can change them in the Emu web-app, but as soon as I save the changes they are wrong again.

and I was worried the save operation was also somehow effected. Two sep. issues. Was just able to reproduce the problem with the TextGrid you sent me btw.

colan101 commented 6 years ago

Okay, I was just not sure if I explained the issue good enough.

raphywink commented 6 years ago

As of developer version 0.2.3.9020 emuR (install with devtools::install_github("IPS-LMU/emuR")) uses only the readr::read_* set of functions to read text files. This prevents a on-read-recoding of R under Windows (7) in certain instances (everything stays UTF-8). Using this version I was successful in converting the TextGrid you sent me via Email and all the encoding seems to be in tackt!

Could you check this for me on your machine? Does it work with my fix?

colan101 commented 6 years ago

After some problems with installing a package from github (windows and R...) that resulted in reinstalling R and RStudio completely, I finally managed to check if it works on my machine.

First problem solved! Converting the TextGrid to JSON works now and all the encoding is correct!

However, the problem with saving in the EMU web-app remains. Everything is displayed correctly, but as soon as I save a file there, the JSON file has the wrong encoding again.

raphywink commented 6 years ago

Glad we got the convert_TextGridCollection() sorted! As the EMU-webApp problem is a different issue could you open up a new issue (feel free to reference this one though). Could you also include the output of devtools::session_info() and what browser you are using and so on? Thanks