ReadAlongs / Studio

Audiobook alignment for Indigenous languages
https://readalongs.github.io/Studio/
Other
35 stars 20 forks source link

UnboundLocalError exception aligning UDHR fra #26

Closed joanise closed 4 years ago

joanise commented 4 years ago

On branch Studio: dev.g2p g2p: master OpenSamples: master, all up to date as of now

cd OpenSamples
readalongs align -i -s -f -l fra UDHR-Librivox/human_rights_un_frn-preamble.txt UDHR-Librivox/human_rights_un_frn_ezwa_64kb-preamble.mp3 output/UDHR-fra-preamble

outputs:

?[32mINFO?[0m - Server initialized for eventlet.
INFO - Words (<w>) not present; tokenizing
Traceback (most recent call last):
  File "C:\Users\joanise\RAS\ras-env\Scripts\readalongs-script.py", line 11, in <module>
    load_entry_point('readalongs', 'console_scripts', 'readalongs')()
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 557, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 717, in main
    rv = self.invoke(ctx)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 412, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "c:\users\joanise\ras\studio\readalongs\cli.py", line 115, in align
    if kwargs['save_temps'] else None))
  File "c:\users\joanise\ras\studio\readalongs\align.py", line 109, in align_audio
    xml = convert_xml(xml)
  File "c:\users\joanise\ras\studio\readalongs\text\convert_xml.py", line 194, in convert_xml
    convert_words(xml_copy, word_unit, output_orthography)
  File "c:\users\joanise\ras\studio\readalongs\text\convert_xml.py", line 143, in convert_words
    all_indices = compose_tiers(indices)
  File "c:\users\joanise\ras\studio\readalongs\text\util.py", line 271, in compose_tiers
    reduced_indices = compose_indices(tiers[0], tiers[1])
  File "c:\users\joanise\ras\studio\readalongs\text\util.py", line 256, in compose_indices
    results.append((i1_in, highest_i2_found))
UnboundLocalError: local variable 'highest_i2_found' referenced before assignment
roedoejet commented 4 years ago

I think I fixed this here: 0aeedd5be38aed1e2b39da60a8135d0f65ad7813

roedoejet commented 4 years ago

I ran the alignment and it looks like there are a bunch of combining characters that got through the g2p like \u0300 combining grave and \u0302 combining circumflex so there are probably some changes needed in the g2p

roedoejet commented 4 years ago

I ran the alignment and it looks like there are a bunch of combining characters that got through the g2p like \u0300 combining grave and \u0302 combining circumflex so there are probably some changes needed in the g2p

I believe I also just fixed this on the dev.fra branch of g2p here: https://github.com/roedoejet/g2p/commit/93d0781f22dfd0ed0fdc87cd97c089775fba6a6c

roedoejet commented 4 years ago

It's now doing the alignment but I'm getting ERROR - Alignment produced a different number of segments and tokens, please examine dictionary and input audio and text.

joanise commented 4 years ago

Just tested with dev.fra branch on g2p, and I get the same error. Thanks for fixing my French g2p and the RAS bug.

joanise commented 4 years ago

I found the problem. Word <w>s</w> goes to nothing because of g2p rule s,,,\s|$. So the .dict file goes from token t0b0d0p10s0w42 to t0b0d0p10s0w44, skipping t0b0d0p10s0w43, which is empty.

Two things here: 1) I should fix my rule not to erase a stand-along "s", especially since that's a real word in French, e.g., in "s'efforcent".

2) Studio should gracefully handle a word that vanishes, either with an explicit error message flagging it, if we don't want to support it, or with a way to align despite it if we do want to support it.

roedoejet commented 4 years ago

Nice find! OK, if you push that change to g2p dev.fra I'll merge it with master. Will you turn point 2 into an issue?

joanise commented 4 years ago

Sure, but...

1) It might take me some more time to fix this. Go ahead and merge dev.fra now, I don't know when I'll succeed in fixing it. I can work on master when I'm ready to figure it out. I've pushed two other unrelated small fixes there too.

2) Sure, I'll turn that into an issue.