aleksas / lt-norm-stress-dataset-gen

tensor2tensor for stressing text in Lithuanian
1 stars 0 forks source link

sequence labeling #1

Open aleksas opened 5 years ago

aleksas commented 5 years ago

Sorry to bother you, @martinpopel. Noticed your comment in a post . You've mentioned

If this is really the case (and the whole task is just about inserting spaces), then it would be better to treat the task as sequence labeling (binary: split vs. no-split) rather than sequence-to-sequence translation

First of all pardon my cluelessness, I'm new to tensor2tensor and text processing + NN in general. My simple question is:

I've looked through t2t Text2ClassProblem list and it seams they all assign classes to sentences. Should it be possible to assign a class to achieve split vs no-split as you've mentioned or a lower level Problem class should be defined?

More complex question:

I was wondering if this would be better in solving my problem. I mean using sequence labeling instead of approaching as translation problem. What I want to achieve is to automatically stress (accentation) words in Lithuanian language. Stressing is quite complex in Lithuanian language, meaning - depending on context word meaning may differ which may cause accented syllable to change. I tried to train transformer as a translation problem but results are not satisfactory - too many "translation" errors. It feels like shooting a bird with a cannon. What is necessary for model to provide is just a single position within the word and type of stress if the word is stressed at all (it may be unstressed). I've looked through t2t Text2ClassProblem list and it seams they all assign classes to sentences. Any suggestions on how to move forward?

martinpopel commented 5 years ago

Text2Class problems assign a single class to the whole sentence, so this is not usable here. If each word can have just two variants - stresses and unstressed - you could treat the task as sequence labeling with two classes and use just transformer_encoder followed by a softmax (and no decoder). I have never implemented this in T2T, so I cannot provide more hints. However, I guess the Lithuanian stress is more complex. Maybe you could split the text into syllables (instead of subwords) and predict a class for each syllable (e.g. accent and pitch).

Yet another option could be character-based T2T sequence2sequence. This should be easy to try.

aleksas commented 5 years ago

Thanks for your prompt response. Still experimenting with translation but smaller vocabularies, just out of curiosity. For several iterations of reducing vocabulary size eval precision increases steadily.

Will try char level translation. Just realized I already have it defined since I've adopted en-de problem.

EDIT Just to save for the future: