UAlbertaALTLab / recording-validation-interface

Maskwacîs recordings validation interface
https://speech-db.altlab.app/
Other
1 stars 1 forks source link

Merging phrases in speech-db #457

Open fbanados opened 2 months ago

fbanados commented 2 months ago

Some times there are multiple entries for the same phrase. There should be an interface to merge them, and a script to automatically merge entries whose transcription and translation are the same. (Split from #444)

fbanados commented 2 months ago

Example of multiple entries: https://speech-db.altlab.app/maskwacis/search/?query=namôya+ê-ayamihât

fbanados commented 2 months ago

This example raises interesting issues, as analysis and translations are different: Screenshot 2024-09-05 at 4 24 19 PM

fbanados commented 2 months ago

Question: What are the total set of fields that define two phrases as "the same" here, as in "it would be ok to automatically merge them and pick any of them"? I'm wondering in particular about those fields in the database beyond transcription and translation:

I'm currently asking for all fields to be the same, but that is too detailed in some cases. I am inclined to disregard differences in field_transcription (which arises, e.g., when there's been a change in the transcription that makes them now the same), modifier (person that last touched the entry), and semantic class (RW needs to be regenerated anyways. For the others I don't know, this would require a linguist decision.

aarppe commented 2 months ago

I had originally been thinking that if the transcription (in its latest state, so not necessarily the field transcription) and the translation (excluding spaces at the edges) are exactly the same, then the entries could be merged. For the other fields, if they only occur for one entry or not another, or are exactly the same for both entries, then one could use that common value. For other fields that do not match, one could combine them for the merged entry.

But would this result in ambiguous cases?

fbanados commented 2 months ago

I don't think it results in ambiguous cases. I was designing an interface for automatically listing all possible candidates, but that will be unneccessary once all the ambiguous cases are dealt with. Also I would not be surprised if that would lead the interface to take too long to load and timeout, so I think it's better to have the automatic merging done separately. In general, automatically merging can be done with a manage.py command, and we can keep the interface just for search and merge.

fbanados commented 2 months ago

Code is ready, action to decide on running django command on server to be discussed via email.

aarppe commented 2 months ago

Adding the linguist-administrator role is needed for linguists to undertake merging of individual items. That would be useful as checking the behavior with indidivual entries, before/instead of running merging whole-sale computationally.

aarppe commented 2 weeks ago

@fbanados I don't think we need to delay this any further, as I've been able to observe that the merging of individual entries has worked properly - so we can proceed with computationally merging all the entries for which the transcriptions and translations are exactly the same.

fbanados commented 2 weeks ago

I will make a database backup before merging

fbanados commented 2 weeks ago

Entries merged.