IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)
http://ips-lmu.github.io/EMU.html
23 stars 15 forks source link

Language independent web services with transcription in IPA or SAMPA #162

Closed nikopartanen closed 6 years ago

nikopartanen commented 6 years ago

According to the description of BAS Webservices, it seems that some tools can be run without specifying the language. If I try to run:

runBASwebservice_g2pForTokenization(handle = dbHandle,
  transcriptionAttributeDefinitionName = 'orth', language = 'sampa',
  orthoAttributeDefinitionName = 'ORT', resume = FALSE, verbose = TRUE)

I get an error, which is of course very clear:

 n bas_download(res, g2pfile, session, bundle) : 
  Unsuccessful webservice call in bundle kpv_izva19570000-290_3bz-05, session 0000: 
<WebServiceResponseLink><success>false</success><downloadLink></downloadLink>
<output>ERROR (class de.bas.exception.InvalidClosedVocabularyOptionException): 
Could not execute the format conversion wrapper, error message: 
The value sampa of the type lng (Language) is not in the closed vocabulary 
[cat, deu, eng, fin, hat, hun, ita, mlt, nld, nze, pol, sqi-SQ, aus-AU, eus-ES, eus-FR, cat-ES, nld-NL, eng-US, eng-AU, eng-GB, eng-NZ, ekk-EE, kat-GE, fin-FI, fra-FR, deu-DE, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, gsw-CH, hat-HT, hun-HU, ita-IT, mlt-MT, pol-PL, ron-RO, rus-RU, slk-SK, spa-ES, und]! 
For valid options please check https://clarin.phonetik.uni-muenchen.de/BASWebServices/BAS_Webservices.cmdi.xml Aborting!</output><warnings></warnings></WebServiceResponseLink>

Then I was reading that with und one could provide a mapping, but I don't know if this is possible with emuR. Basically I have transcription of Komi-Zyrian which are aligned on utterance level, and I've been now passing them to the web service with language rus-RU, which works suprisingly well but has some issues. So I was wondering that as I can also convert the text directly to SAMPA or IPA, then maybe there is a way to do the initial tokenization step with this through a language independent model. At least I would like to test whether this is an improvement to what I now get.

So my question is just whether it is possible with emuR to get from utterance aligned transcriptions into phoneme aligned level without passing the data through any of the already defined languages.

raphywink commented 6 years ago

@NPoe could you maybe have a look at this?

NPoe commented 6 years ago

Yes, with language="und" you have to define your own orthography->phonology mapping file. It should correspond to these specifications: http://www.bas.uni-muenchen.de/Bas/readme_g2p_mappingTable.txt

Once you've done that, you upload it in the params list as follows:

library(RCurl) # you might need to install RCurl first runBASwebservice_g2pForTokenization(handle, "transcription", language="und", orthoAttributeDefinitionName="orthography", params=list(imap=RCurl::fileUpload("../test.map"))) runBASwebservice_g2pForPronunciation(handle, "orthography", language="und", canoAttributeDefinitionName="cano", params=list(imap=RCurl::fileUpload("../test.map")))

I've just tested it with a dummy mapping, but I don't think this method has been tested with a real mapping file before. So definitely let us know how it goes.

Alternatively, you can also do the tokenization step with a known language such as Russian, and then do the pronunciation step with your mapping file:

runBASwebservice_g2pForTokenization(handle, "transcription", language="rus-RU", orthoAttributeDefinitionName="orthography") runBASwebservice_g2pForPronunciation(handle, "orthography", language="und", canoAttributeDefinitionName="cano", params=list(imap=RCurl::fileUpload("../test.map")))

nikopartanen commented 6 years ago

Thanks for help! This seems to work very well. I ran it now with following code:

runBASwebservice_g2pForTokenization(handle = dbHandle,
  transcriptionAttributeDefinitionName = 'orth', language = 'und',
  orthoAttributeDefinitionName = 'ORT', resume = FALSE,
  verbose = TRUE, params=list(imap=RCurl::fileUpload("kpv-sampa.txt")))

# This also works, don't know if there is a difference
# runBASwebservice_g2pForTokenization(handle = dbHandle,
#                  transcriptionAttributeDefinitionName = 'orth', language = 'rus-RU',
#                 orthoAttributeDefinitionName = 'ORT', resume = FALSE,
#                 verbose = TRUE)

runBASwebservice_g2pForPronunciation(handle = dbHandle,
                  orthoAttributeDefinitionName = 'ORT',
                  language = 'und', 
                  canoAttributeDefinitionName = 'KAN', 
                  params = list(embed = 'maus', imap=RCurl::fileUpload("kpv-sampa.txt")), 
                  resume = FALSE, 
                  verbose = TRUE)

Up to this everything works well. The main problem was that the orthography->phonology mapping didn't work very well when it was passed through Russian, as it took into account lots of phenomena such as vowel reduction which doesn't happen in Komi. The last step to get phoneme-level alignation, however, didn't work without specifying any language:

runBASwebservice_maus(handle = dbHandle,
                      canoAttributeDefinitionName = 'KAN',
                      language = 'rus-RU',
                      mausAttributeDefinitionName = 'MAUS',
                      chunkLevel = NULL,
                      turnChunkLevelIntoItemLevel = TRUE,
                      perspective = 'default',
                      resume = FALSE,
                      verbose = TRUE)

Specifying Russian at this stage is in principle ok, the result looks pretty workable already. One problem I have is that Komi has central unrounded vowel, which I would usually mark with ə / @. It is marked with ӧ in Komi orthography, but I can't map it into @ in SAMPA as the code above gives an error if there are phonemes not present in the Russian mapping. I was thinking I could just come up with some way to map it so that it stays distinct but would be passed as just some other vowel, for example e. This is the error message:

INFO: Sending ping to webservices provider.
INFO: Running MAUS on emuDB containing 10 bundle(s)...
  |===========                                                                                                    |  10%
Error in bas_download(res, maufile, session, bundle) : 
  Unsuccessful webservice call in bundle kpv_izva19570000-290_3bz-09, session 0000:
<WebServiceResponseLink><success>false</success><downloadLink></downloadLink>
<output>ERROR (class java.lang.Exception): MAUS execution did not exit properly and exited
 with message:ERROR maus : something went wrong while reading the BPF input, probably
it contains a symbol that is not defined for this language - exiting: ERROR: unknown phoneme
 (E) in E k m 1 s s_j o</output><warnings></warnings></WebServiceResponseLink>
NPoe commented 6 years ago

I'm not sure from your description whether you have already tried the language-independent call runBASwebservices_maus(language="sampa", ...)? If so, what error message did that result in?

If you have to go via Russian, you will indeed have to map any non-Russian phonemes to Russian ones beforehand. In your error message, MAUS was saying that E is not in its Russian inventory. But I would first try to use language="sampa" if you haven't already. (And if you have and there was an error, please let us know also.)

nikopartanen commented 6 years ago

I tried that, but then I got an error and didn't test further. But I made a mistake there and tried at the same time also specify the imap parameter, which this part doesn't seem to need. I tried it now only with 'sampa' in the language. That works. The results seem to be quite different whether the language is specified as 'rus-RU' or 'sampa', I'll be looking into it deeper in coming days. Thanks a lot for help!

nikopartanen commented 6 years ago

Thanks for help, this clarified a lot! Is it possible to see the mapping files for other languages such as Russian somewhere online? Or a list of accepted SAMPA items?

NPoe commented 6 years ago

Sure! I think the easiest way is to use this URL:

https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runMAUSGetInventar?LANGUAGE=rus-RU

(replace rus-RU by the language that you are interested in). All phonemes in the first column (MAUS) are part of that language's SAMPA inventory.

nikopartanen commented 6 years ago

Thanks a lot! Wonderful! This issue can be closed, I got all the help I needed for now! :)