Open fbanados opened 2 months ago
Example of multiple entries: https://speech-db.altlab.app/maskwacis/search/?query=namôya+ê-ayamihât
This example raises interesting issues, as analysis and translations are different:
Question: What are the total set of fields that define two phrases as "the same" here, as in "it would be ok to automatically merge them and pick any of them"? I'm wondering in particular about those fields in the database beyond transcription and translation:
field_transcription
analysis
comment
status
semantic class
(RW)modifier
I'm currently asking for all fields to be the same, but that is too detailed in some cases. I am inclined to disregard differences in field_transcription
(which arises, e.g., when there's been a change in the transcription that makes them now the same), modifier
(person that last touched the entry), and semantic class
(RW needs to be regenerated anyways. For the others I don't know, this would require a linguist decision.
I had originally been thinking that if the transcription (in its latest state, so not necessarily the field transcription) and the translation (excluding spaces at the edges) are exactly the same, then the entries could be merged. For the other fields, if they only occur for one entry or not another, or are exactly the same for both entries, then one could use that common value. For other fields that do not match, one could combine them for the merged entry.
But would this result in ambiguous cases?
I don't think it results in ambiguous cases. I was designing an interface for automatically listing all possible candidates, but that will be unneccessary once all the ambiguous cases are dealt with. Also I would not be surprised if that would lead the interface to take too long to load and timeout, so I think it's better to have the automatic merging done separately. In general, automatically merging can be done with a manage.py
command, and we can keep the interface just for search and merge.
Code is ready, action to decide on running django command on server to be discussed via email.
Adding the linguist-administrator role is needed for linguists to undertake merging of individual items. That would be useful as checking the behavior with indidivual entries, before/instead of running merging whole-sale computationally.
@fbanados I don't think we need to delay this any further, as I've been able to observe that the merging of individual entries has worked properly - so we can proceed with computationally merging all the entries for which the transcriptions and translations are exactly the same.
I will make a database backup before merging
Entries merged.
Some times there are multiple entries for the same phrase. There should be an interface to merge them, and a script to automatically merge entries whose transcription and translation are the same. (Split from #444)