[data reading] Qumin always re-segments wordforms, even in the presence of spaces

XachaB commented 4 weeks ago

The behaviour is that of Qumin V.1: always re-segment. However, it might be better to make it possible to respect the given segmentation. Unfortunately, Qumin has quirks regarding phonology, and the segmentation needed tends to be different from that which I use in later datasets (where I write tiered information, such as length, tones, stress, on the syllable's vowel).

Note that all paralex datasets now MUST have space separation:

"The value of the phon_form MUST be a sequence of space-separated segments, e.g. not "dominoːrum" but "d o m i n oː r u m"."

I think we need the following cases:

There are no spaces => Throw an error, not paralex compliant (although we do have the means to still parse things... should we ?)
There are spaces, by default, split on spaces
There are spaces, but a user-defined config parameter asks to re-split: re-split (current behaviour)

@JPapir : what do you think ? Is that reasonable ?

The line doing the splitting is here:

https://github.com/XachaB/Qumin/blob/2ea1782cfebd8f662b8075770ebbf279cc36e6f3/src/qumin/representations/segments.py#L44

XachaB commented 4 weeks ago

Actually, when there is no space, as long as resegment=True, I think we shouldn't complain and just re-segment.

Currently, the behaviour also ignores any unknown segments silently, which looks like a good way to get bugs (unless one has forced a segcheck). I think I also need to remove that.

JPapir commented 4 weeks ago

Currently, the behaviour also ignores any unknown segments silently, which looks like a good way to get bugs (unless one has forced a segcheck). I think I also need to remove that.

Normally, Paralex already ensures that this doesn't happen, but maybe it would be better to throw an exception.

There are no spaces => Throw an error, not paralex compliant (although we do have the means to still parse things... should we ?)

We usually throw an error when a dataset is not Paralex compliant, so we should probably be consistent with that. But, if we do...

Actually, when there is no space, as long as resegment=True, I think we shouldn't complain and just re-segment.

...Then I agree that it should be done only if an overt keyword is given. That way, the user will always be conscious of what is going on behind the stage.

There are spaces, by default, split on spaces

I hadn't noticed that this wasn't the case, but we should definitely do that.

There are spaces, but a user-defined config parameter asks to re-split: re-split (current behaviour)

Why not, I do not see use cases, but since this is already implemented, it is probably easy to keep it.

XachaB commented 4 weeks ago

The use case is situations where in recent datasets, I mark any non-segmentals (tone, stress, length) directly on a segment (not separated by spaces), but my sound inventory neatly defines these on separate rows. More recent software is able to parse this format, eg "b aː b a" with an inventory of three sounds "b", "a", and "ː" (non-segmental). But Qumin would only work with an inventory of "b", "a", "aː" (and then a long version of every sound). However, with the first inventory, Qumin can indeed work, if it re-parses "baːba", using the list of sounds "b", "a", "ː", and find "b a ː b a".

JPapir commented 4 weeks ago

All right, I see the link with the first post now. Then we can try this. I am not very familiar with this part of Qumin though. If I understand well, this will only work if the non-segmental information is marked in an way which looks like segmental in the phon_form (which is the case for ː, but maybe not for all segments, for instance if someone has a tone written with a diacritic : â or á).

JPapir commented 4 weeks ago

I guess that the best solution would be to make Qumin itself tier-compatible, but this would probably be way too much (useless) work on the pattern module.

XachaB commented 4 weeks ago

At least, that's not a short term goal :)

XachaB / Qumin

[data reading] Qumin always re-segments wordforms, even in the presence of spaces #35