CambridgeSemiticsLab / nena

The North Eastern Neo-Aramaic Database Site
https://nena.ames.cam.ac.uk
5 stars 0 forks source link

Cleaning: schwa #82

Closed Paul32N closed 8 months ago

Paul32N commented 3 years ago

There are near-identical symbols in the data entries.

ǝ Latin small turned E U+01DD e.g. https://nena.ames.cam.ac.uk/dialects/73/feature/5.3.7./edit

ə Latin small schwa U+0259 e.g. https://nena.ames.cam.ac.uk/dialects/109/feature/5.3.7./edit

These are now treated as two separate symbols, but should be one and the same. The second one is the correct one. The small turned E should be treated in the same way as small schwa. I don't know what is the best way for you to proceed. Either you have the system recognize small turned E as an instance of schwa or you could automatically find and replace the turned E by the small schwa?

Thanks once again!

codykingham commented 3 years ago

Just FYI for @jamespstrachan , we also encounter this issue in the text corpus on occasion. This substitution should be in that subs list I sent a while back. An example can be seen here:

https://github.com/CambridgeSemiticsLab/nena_corpus/blob/b15e3cf681abe42ffa462256e8598d8f2be71917/sources/msdoc2html/convert.py#L36

jamespstrachan commented 3 years ago

I have implemented all of the non-regex substitutions from Cody's link above, which includes the turned-e to schwa replacement. The correction is applied to the text when it is loaded into the editor, and then on every keystroke and the results only saved when the user hits save on the page. Given every one of these texts requires cleaning and/or alignment and the majority are still to be input I hope this is the appropriate place to enforce the rules.

jamespstrachan commented 3 years ago

And I have just realised that Paul was talking about the grammar entries rather than the text corpus!

It's possible to try to bulk-replace these characters in the database, but I'd rather avoid such a sweeping action if possible. Are the occurrences of intended-schwa chars predictable enough that the find-and-replace tool from #89 could be used to clean?

Paul32N commented 3 years ago

Thanks. I was talking about the grammar tool, indeed. It is difficult to spot the wrong symbol yourself, so only a character replacement function would make sure you get rid of all of them. It is only predictable in so far as presumably large numbers are to be found for particular dialects (so not particular features as such). Since the find&replace function operates only in the feature-list/comparative view, one would have to do this for every feature where this potentially could occur. Perhaps one less laborious and less invasive way to resolve this is to have a sweeping replacement action per dialect. Would you be able to set parameters for a replacement action? Then you just limit yourself to one dialect at the time? I may be able to find out which ones are likely to have need of replacement.

jamespstrachan commented 3 years ago

I have drafted some queries which broadly replace all instances of turned-e with schwa in the following tables and fields:

update dialects_dialectfeatureentry set entry = replace(entry,"ǝ","ə") where entry like "%ǝ%";
update dialects_dialectfeatureentry set comment = replace(comment,"ǝ","ə") where comment like "%ǝ%";
update dialects_dialectfeature set comment = replace(comment,"ǝ","ə") where comment like "%ǝ%";
update dialects_dialectfeatureexample set example = replace(example,"ǝ","ə") where example like "%ǝ%";
update audio_audio set transcription = replace(transcription,"ǝ","ə") where transcription like "%ǝ%";
update audio_audio set transcript = replace(transcript,"ǝ","ə") where transcript like "%ǝ%";

I have run these against the staging database, I believe successfully. Please could you check that they're no longer cropping up? Once you're happy that no data has been damaged by the update, let me know and I'll apply the same changes to production.

Of course, with the exception of the audio transcription interface, there's nothing to prevent future additions from using the wrong character. I suspect putting validation or value-coercion into all the possible editable fields is beyond the scope of this ticket!

Paul32N commented 1 year ago

Starting point:https://nena.ames.cam.ac.uk/grammar/features/44) image image

If check the data for feature Dialects with 4.1.1. 3ms). and go the Summary of full table at the bottom, some entries that should be considered identical are still listed as distinct, which must be due to the use of different symbols. Did we ultimately implement the conversion of inverted E to schwa or not?

On the picture you see that inverted E sporadically still occurs. But it seems to me that all of the cases listed as <ʾawən> contain a schwa, and yet they're listed as distinct, so it may also have to do with something else.

jamespstrachan commented 1 year ago

From my old comment in this thread:

I have run these against the staging database, I believe successfully. Please could you check that they're no longer cropping up? Once you're happy that no data has been damaged by the update, let me know and I'll apply the same changes to production.

I suspect that the staging database has not been overwritten since that time so you should still be able to check the effectiveness of grammar entry schwa conversion there. I'm pretty sure this has never been applied to production due to lack of confirmation. Easy enough to run if you're happy it's not breaking data.

This bit from the same comment still applies:

Of course, with the exception of the audio transcription interface, there's nothing to prevent future additions from using the wrong character. I suspect putting validation or value-coercion into all the possible editable fields is beyond the scope of this ticket!

jamespstrachan commented 8 months ago

this has now been done