Signbank / Global-signbank

An online sign dictionary and sign database management system for research purposes. Developed originally by Steve Cassidy/ This repo is a fork for the Dutch version, previously called 'NGT-Signbank'.
http://signbank.cls.ru.nl
BSD 3-Clause "New" or "Revised" License
19 stars 12 forks source link

CSV Import #897

Open susanodd opened 1 year ago

susanodd commented 1 year ago

Make CSV Import more parametric. The python code is very long-winded. Add parameters for the user to e.g., toggle whether fields should be erased or ignored when empty in the CSV.

susanodd commented 1 year ago

import-csv-update-notes-toggle

susanodd commented 1 year ago

Okay, I just recalled that it's the Tags that need to be this way. It probably needs to be a list of toggles so they can be set independently.

susanodd commented 1 year ago

TO DO: After the field choice / model translations / semantic fields / derivation history / handshape modifications to make use of foreign keys rather than field choice foreign keys, the Import CSV Update code for updating these fields has not yet been modified to work with the new models. They used to all be field choices. So at the moment, if the user has the columns 'Semantic Field', 'Derivation history', 'Strong Hand', 'Weak Hand' in their CSV file, these are going to be skipped. Normally the user removes columns except for those to update. But if columns have not been removed, then these listed are ignored so as not to generate an error.

[This is a side effect of #658. Nothing to do with being hard-coded, everything to do with the introduction of new models.]

susanodd commented 1 year ago

@ocrasborn I have implemented your wish to be able to remove tags and add notes.

I added some toggles to choose the semantics of an empty cell, as well as whether a value in the notes column should replace existing notes, or be added to existing notes.

Here are screenshots of the toggles and the identified changes. The settings and the csv allow to remove the old tag (for this example) and add a new note to the existing notes.

settings-csv-update-notes-tags

update_csv_remove_tags_add_notes

susanodd commented 1 year ago

This has been pushed to branch csv_interface

The code is very intricate but it works. I'm not sure whether anybody will want to read the code to review it. It doesn't affect other code and is self-contained. I believe it can be merged without breaking anything.

In order to "parse" the new notes and compare them to the existing notes to determine whether they have been updated, the notes name is first mapped to the notes machine value. Then parsed. Then mapped back. Here is the code which splits the notes of the csv cell. https://github.com/Signbank/Global-signbank/blob/146000dd0649e9bacb570c6f697c1e56d099aa88/signbank/tools.py#L1035-L1054 This was necessary because of the text field in notes. The text fields users have written also contain punctuation, which messes up simple parsing. Some of the note names contain parentheses.

The code has been tested by exporting to csv all of the NGT glosses and then importing it again as an update. In order to ensure that the "new" notes syntax matches the "original" notes syntax of the csv. The export to csv was modified in order to make it work correctly. Namely the sorting was causing problems. Now the export to csv uses the same sorting and tuple reordering as the import CSV update.

There are also some glitches because the note names include both "Note" and "Project Note", which caused problems because one is a substring of the other. There is a reverse sort on the name field of a NoteType field in order to avoid a wrong match. https://github.com/Signbank/Global-signbank/blob/146000dd0649e9bacb570c6f697c1e56d099aa88/signbank/tools.py#L1009-L1010 https://github.com/Signbank/Global-signbank/blob/146000dd0649e9bacb570c6f697c1e56d099aa88/signbank/tools.py#L1108

susanodd commented 1 year ago

@ocrasborn The recent changes for Notes and Tags as shown above are live now.

susanodd commented 1 year ago

I'm busy revising the CSV import code. There was some irrelevant obsolete code that has been removed. Additionally extra (now unnecessary) things related to field choice fields.

To assist in debugging the code, I exported the NGT dataset (signs older than 2017, because I was initially interested in missing language fields, since the default is defined as English now - see fragments below)

https://github.com/Signbank/Global-signbank/blob/fbaabbd3ed80e8612c32523e169698cfde8cfd12/signbank/tools.py#L1622-L1628

https://github.com/Signbank/Global-signbank/blob/fbaabbd3ed80e8612c32523e169698cfde8cfd12/signbank/settings/server_specific/default.py#L82

But the above is only used on Annotation fields, not on Lemma translations (which are sometimes empty for English for NGT)

So it works to import the original CSV file now without bugs. (There were some bugs related to Django and Python upgrades!! Those have been fixed -- not on master yet)

Here are two things that the import of the original file (as an update) gives:


- 
- Import CSV Update
- 
-     WARNING: For gloss STUDENT-B (4018), new Sequential Morphology value 3819 is duplicate.
- 
-     WARNING: For gloss wesseltest4 (3283), new Sequential Morphology value 2317 is duplicate.

Indeed, the first one is sequential morphology where both components are L. The current code (export and import) exports only the gloss ids of the sequential morphology, not the role (component) Since the export only exported the gloss ids, the import detects this as duplicate rather than sequential (the compents of both of those glosses have different labels).

susanodd commented 1 year ago

Related to this, I found this issue:

351

susanodd commented 1 year ago

I found a gnarly gloss glitch while debugging code.

The gloss ELAN has a tab (\t) at the end of its English annotation field.

susanodd commented 1 year ago

I fixed various bugs in the csv gloss export/import/update routines:

susanodd commented 1 year ago

I fixed various bugs in Import CSV Update related to Handshape and FieldChoice fields.

The Handshape fields were actually being ignored (glossed over) during input, apparently when we were busy modifying field choices and handshapes to have model translations we just commented out some code that would need be updated. The FieldChoice fields were also not being "dereferenced" from just a character "machine value" to an actual FieldChoice object reference. In both of those cases, this was causing the actual "do changes" to not update the fields.

The changes have been deployed. This still needs to be revised for SemanticField and DerivationHistory (foreign key to model, multiselect-) fields. Those are still being skipped.