Closed nikopartanen closed 6 years ago
@NPoe could you maybe have a look at this?
Yes, with language="und" you have to define your own orthography->phonology mapping file. It should correspond to these specifications: http://www.bas.uni-muenchen.de/Bas/readme_g2p_mappingTable.txt
Once you've done that, you upload it in the params list as follows:
library(RCurl) # you might need to install RCurl first runBASwebservice_g2pForTokenization(handle, "transcription", language="und", orthoAttributeDefinitionName="orthography", params=list(imap=RCurl::fileUpload("../test.map"))) runBASwebservice_g2pForPronunciation(handle, "orthography", language="und", canoAttributeDefinitionName="cano", params=list(imap=RCurl::fileUpload("../test.map")))
I've just tested it with a dummy mapping, but I don't think this method has been tested with a real mapping file before. So definitely let us know how it goes.
Alternatively, you can also do the tokenization step with a known language such as Russian, and then do the pronunciation step with your mapping file:
runBASwebservice_g2pForTokenization(handle, "transcription", language="rus-RU", orthoAttributeDefinitionName="orthography") runBASwebservice_g2pForPronunciation(handle, "orthography", language="und", canoAttributeDefinitionName="cano", params=list(imap=RCurl::fileUpload("../test.map")))
Thanks for help! This seems to work very well. I ran it now with following code:
runBASwebservice_g2pForTokenization(handle = dbHandle,
transcriptionAttributeDefinitionName = 'orth', language = 'und',
orthoAttributeDefinitionName = 'ORT', resume = FALSE,
verbose = TRUE, params=list(imap=RCurl::fileUpload("kpv-sampa.txt")))
# This also works, don't know if there is a difference
# runBASwebservice_g2pForTokenization(handle = dbHandle,
# transcriptionAttributeDefinitionName = 'orth', language = 'rus-RU',
# orthoAttributeDefinitionName = 'ORT', resume = FALSE,
# verbose = TRUE)
runBASwebservice_g2pForPronunciation(handle = dbHandle,
orthoAttributeDefinitionName = 'ORT',
language = 'und',
canoAttributeDefinitionName = 'KAN',
params = list(embed = 'maus', imap=RCurl::fileUpload("kpv-sampa.txt")),
resume = FALSE,
verbose = TRUE)
Up to this everything works well. The main problem was that the orthography->phonology mapping didn't work very well when it was passed through Russian, as it took into account lots of phenomena such as vowel reduction which doesn't happen in Komi. The last step to get phoneme-level alignation, however, didn't work without specifying any language:
runBASwebservice_maus(handle = dbHandle,
canoAttributeDefinitionName = 'KAN',
language = 'rus-RU',
mausAttributeDefinitionName = 'MAUS',
chunkLevel = NULL,
turnChunkLevelIntoItemLevel = TRUE,
perspective = 'default',
resume = FALSE,
verbose = TRUE)
Specifying Russian at this stage is in principle ok, the result looks pretty workable already. One problem I have is that Komi has central unrounded vowel, which I would usually mark with ə / @. It is marked with ӧ in Komi orthography, but I can't map it into @ in SAMPA as the code above gives an error if there are phonemes not present in the Russian mapping. I was thinking I could just come up with some way to map it so that it stays distinct but would be passed as just some other vowel, for example e. This is the error message:
INFO: Sending ping to webservices provider.
INFO: Running MAUS on emuDB containing 10 bundle(s)...
|=========== | 10%
Error in bas_download(res, maufile, session, bundle) :
Unsuccessful webservice call in bundle kpv_izva19570000-290_3bz-09, session 0000:
<WebServiceResponseLink><success>false</success><downloadLink></downloadLink>
<output>ERROR (class java.lang.Exception): MAUS execution did not exit properly and exited
with message:ERROR maus : something went wrong while reading the BPF input, probably
it contains a symbol that is not defined for this language - exiting: ERROR: unknown phoneme
(E) in E k m 1 s s_j o</output><warnings></warnings></WebServiceResponseLink>
I'm not sure from your description whether you have already tried the language-independent call runBASwebservices_maus(language="sampa", ...)? If so, what error message did that result in?
If you have to go via Russian, you will indeed have to map any non-Russian phonemes to Russian ones beforehand. In your error message, MAUS was saying that E is not in its Russian inventory. But I would first try to use language="sampa" if you haven't already. (And if you have and there was an error, please let us know also.)
I tried that, but then I got an error and didn't test further. But I made a mistake there and tried at the same time also specify the imap
parameter, which this part doesn't seem to need. I tried it now only with 'sampa' in the language. That works. The results seem to be quite different whether the language is specified as 'rus-RU' or 'sampa', I'll be looking into it deeper in coming days. Thanks a lot for help!
Thanks for help, this clarified a lot! Is it possible to see the mapping files for other languages such as Russian somewhere online? Or a list of accepted SAMPA items?
Sure! I think the easiest way is to use this URL:
https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runMAUSGetInventar?LANGUAGE=rus-RU
(replace rus-RU by the language that you are interested in). All phonemes in the first column (MAUS) are part of that language's SAMPA inventory.
Thanks a lot! Wonderful! This issue can be closed, I got all the help I needed for now! :)
According to the description of BAS Webservices, it seems that some tools can be run without specifying the language. If I try to run:
I get an error, which is of course very clear:
Then I was reading that with
und
one could provide a mapping, but I don't know if this is possible with emuR. Basically I have transcription of Komi-Zyrian which are aligned on utterance level, and I've been now passing them to the web service with languagerus-RU
, which works suprisingly well but has some issues. So I was wondering that as I can also convert the text directly to SAMPA or IPA, then maybe there is a way to do the initial tokenization step with this through a language independent model. At least I would like to test whether this is an improvement to what I now get.So my question is just whether it is possible with emuR to get from utterance aligned transcriptions into phoneme aligned level without passing the data through any of the already defined languages.