susanodd commented 1 year ago

Make CSV Import more parametric. The python code is very long-winded. Add parameters for the user to e.g., toggle whether fields should be erased or ignored when empty in the CSV.

susanodd commented 1 year ago

import-csv-update-notes-toggle

susanodd commented 1 year ago

Okay, I just recalled that it's the Tags that need to be this way. It probably needs to be a list of toggles so they can be set independently.

susanodd commented 1 year ago

TO DO: After the field choice / model translations / semantic fields / derivation history / handshape modifications to make use of foreign keys rather than field choice foreign keys, the Import CSV Update code for updating these fields has not yet been modified to work with the new models. They used to all be field choices. So at the moment, if the user has the columns 'Semantic Field', 'Derivation history', 'Strong Hand', 'Weak Hand' in their CSV file, these are going to be skipped. Normally the user removes columns except for those to update. But if columns have not been removed, then these listed are ignored so as not to generate an error.

[This is a side effect of #658. Nothing to do with being hard-coded, everything to do with the introduction of new models.]

susanodd commented 1 year ago

@ocrasborn I have implemented your wish to be able to remove tags and add notes.

I added some toggles to choose the semantics of an empty cell, as well as whether a value in the notes column should replace existing notes, or be added to existing notes.

Here are screenshots of the toggles and the identified changes. The settings and the csv allow to remove the old tag (for this example) and add a new note to the existing notes.

settings-csv-update-notes-tags

update_csv_remove_tags_add_notes

susanodd commented 1 year ago

This has been pushed to branch csv_interface

The code is very intricate but it works. I'm not sure whether anybody will want to read the code to review it. It doesn't affect other code and is self-contained. I believe it can be merged without breaking anything.

In order to "parse" the new notes and compare them to the existing notes to determine whether they have been updated, the notes name is first mapped to the notes machine value. Then parsed. Then mapped back. Here is the code which splits the notes of the csv cell. https://github.com/Signbank/Global-signbank/blob/146000dd0649e9bacb570c6f697c1e56d099aa88/signbank/tools.py#L1035-L1054 This was necessary because of the text field in notes. The text fields users have written also contain punctuation, which messes up simple parsing. Some of the note names contain parentheses.

The code has been tested by exporting to csv all of the NGT glosses and then importing it again as an update. In order to ensure that the "new" notes syntax matches the "original" notes syntax of the csv. The export to csv was modified in order to make it work correctly. Namely the sorting was causing problems. Now the export to csv uses the same sorting and tuple reordering as the import CSV update.

There are also some glitches because the note names include both "Note" and "Project Note", which caused problems because one is a substring of the other. There is a reverse sort on the name field of a NoteType field in order to avoid a wrong match. https://github.com/Signbank/Global-signbank/blob/146000dd0649e9bacb570c6f697c1e56d099aa88/signbank/tools.py#L1009-L1010 https://github.com/Signbank/Global-signbank/blob/146000dd0649e9bacb570c6f697c1e56d099aa88/signbank/tools.py#L1108

susanodd commented 1 year ago

@ocrasborn The recent changes for Notes and Tags as shown above are live now.

susanodd commented 1 year ago

I'm busy revising the CSV import code. There was some irrelevant obsolete code that has been removed. Additionally extra (now unnecessary) things related to field choice fields.

To assist in debugging the code, I exported the NGT dataset (signs older than 2017, because I was initially interested in missing language fields, since the default is defined as English now - see fragments below)

https://github.com/Signbank/Global-signbank/blob/fbaabbd3ed80e8612c32523e169698cfde8cfd12/signbank/tools.py#L1622-L1628

https://github.com/Signbank/Global-signbank/blob/fbaabbd3ed80e8612c32523e169698cfde8cfd12/signbank/settings/server_specific/default.py#L82

But the above is only used on Annotation fields, not on Lemma translations (which are sometimes empty for English for NGT)

So it works to import the original CSV file now without bugs. (There were some bugs related to Django and Python upgrades!! Those have been fixed -- not on master yet)

Here are two things that the import of the original file (as an update) gives:


- 
- Import CSV Update
- 
-     WARNING: For gloss STUDENT-B (4018), new Sequential Morphology value 3819 is duplicate.
- 
-     WARNING: For gloss wesseltest4 (3283), new Sequential Morphology value 2317 is duplicate.

Indeed, the first one is sequential morphology where both components are L. The current code (export and import) exports only the gloss ids of the sequential morphology, not the role (component) Since the export only exported the gloss ids, the import detects this as duplicate rather than sequential (the compents of both of those glosses have different labels).

susanodd commented 1 year ago

Related to this, I found this issue:

351

susanodd commented 1 year ago

I found a gnarly gloss glitch while debugging code.

The gloss ELAN has a tab (\t) at the end of its English annotation field.

susanodd commented 1 year ago

I fixed various bugs in the csv gloss export/import/update routines:

a field choice bug in sequential morphology creation!!! (This only occured if sequential morphology is updated. But the error was being caught so not visible.)
a django/python upgrade omission on html unescape (library does not exist anymore, but because the error was being caught it did not show up, this was being called on every line of input in the csv)
bugs in parsing of Notes due to parentheses in the notes label, and appearence of notes labels in the notes text. Parsing of note is done in two steps. First the label is identified and replaced with the machine value, then the notes are parsed, then the labels are put back. This solution was needed in order to accommodate complex text in the notes. Users put gloss ids, colons, parentheses, numbers, as well as tabs in the notes.
Stripping of annotation, translation text. A tab character was found at the end of one of the annotation fields. This caused misaligned columns in the csv export

susanodd commented 1 year ago

I fixed various bugs in Import CSV Update related to Handshape and FieldChoice fields.

The Handshape fields were actually being ignored (glossed over) during input, apparently when we were busy modifying field choices and handshapes to have model translations we just commented out some code that would need be updated. The FieldChoice fields were also not being "dereferenced" from just a character "machine value" to an actual FieldChoice object reference. In both of those cases, this was causing the actual "do changes" to not update the fields.

The changes have been deployed. This still needs to be revised for SemanticField and DerivationHistory (foreign key to model, multiselect-) fields. Those are still being skipped.

Signbank / Global-signbank

CSV Import #897

351