import CSV to create new lemma's

uklomp commented 9 months ago

Hi, I tried to import a CSV file to create 100 new lemma's at the same time in the NGT dataset. I clicked "Import CSV create" (in the Gebaren menu) and followed the instructions from the manual and the instructions that are shown there, so: to import a CSV file with certain headers (dataset, gloss ID, lemma ID). I then get the error "Incorrect Delimiter: comma." - what am I doing wrong? I'm attaching the file so you can see if something's going wrong in the specifications there. themalijst 23-12-20.csv

susanodd commented 9 months ago

(On my mac, it gives a different error, it can't find the headers.) I put a print statement in the code to debug the function that is looking for the headers.

It looks like the original file has semi-colons separating the columns!

It's necessary to use "Export As" instead of "Save As" so it converts it to a csv if necessary.

susanodd commented 9 months ago

@uklomp If you set it to use Semicolon, it imports okay. It reports two duplicates on row 61.

susanodd commented 9 months ago

@uklomp I see in the code that the Update import CSV functions report back if the wrong delimiter was selected. This code isn't in the Create functions. I'll repair that!

It does work to choose Semicolon.

But it doesn't inform the user that a different delimiter should be used. (This will be available when we put the Django bump version live on 11 Jan.)

uklomp commented 9 months ago

OK. Thanks for your help. Then two follow-up questions:

The comma -file will be useable after the Django update, is that correct?
How to make sure it's separated by semicolons? These are my options if I chose the 'export' function (and not 'save as'):

susanodd commented 9 months ago

@uklomp I guess I didn't phrase my sentence well.

Your file uses Semicolons. That's fine. But you need to choose Semicolon in the radio buttons below where you select the file.

(The default is comma. This is only a problem if you use punctuation inside the cells. Like for sentences or senses. Some extra formatting is needed in your cells in that case.)

Feedback about "the wrong delimiter was selected from the radio buttons" will be shown in the new version.

From the screenshot for export, it looks like you just choose CSV export. (The CSV stands for comma separated.)

Comma separated is fine. Your current file works, but you need to select the Radio button for Semicolon.

uklomp commented 9 months ago

OK, thanks. I did that, and then I saw the same message as you did, about line 61. I don't know where it comes from though, as there is no gloss Radioloog in the NGT dataset, and this is what this error/message means, right? Then I adapted line 61 and apparently saved it the wrong way again because I can't upload it anymore. It's pretty frustrating for me because the instruction seemed so clear, and yet I have spent more than one hour already trying to upload these words, in all kinds of different CSV files with different radio buttons clicked. Can we maybe see if there is an easier way after the Django update you mentioned?

susanodd commented 9 months ago

Yes, agreed! I was confused myself with the non-helpful feedback on the site about the delimiter.

I suspect the duplicate complaint is a duplicate among the new rows. The import routines save data as they progress through the rows in order to be able to create the new glosses and lemmas. If there are "bugs" then this is reported. Otherwise a list of proposed changes is reported, and a button to go ahead and create them.

@ocrasborn is kind of the main user of the CSV features.

uklomp commented 9 months ago

OK... that's good to know. Would it be possible for you to upload the file you had, without this line 61, and that we find out in January how to make this process more easy? Because now I have the right file and the right button and no line 61 but now the column 'dataset' isn't recognized and I don't know why...

susanodd commented 9 months ago

@uklomp Okay, I found the bug (after numerous print statements in the code.)

It's reporting the wrong line as being a duplicate!

The duplicate is Verloskundige It's in row 17 and row 62

susanodd commented 9 months ago

That it reported the wrong row may have something to do with the mysterious semicolons.

If you use LibreOffice, it can work with all files and it allows you to choose various separators upon opening a file. We've been using this because of utf-8 characters. Unfortunately, it's not possible to grab its parsing routines and put them in signbank. If you have trouble on some files, it usually works to open them in LibreOffice and then export them again. If you just use the csv extension, it asks if you want it to be in csv (or some other of its own format). themalijst.23-12-20-fix.csv

The linked file was edited in LibreOffice to remove the row, then saved with ".csv" added as an extension, and then to confirm to use the CSV format. It now has commas instead of semicolons. (Thanks to LibreOffice)

susanodd commented 9 months ago

@uklomp you may want to import the above file yourself, as you can see in the screenshot on my local computer, I am the user that created the glosses.....

csv-created-glosses

susanodd commented 9 months ago

@uklomp I modified the feedback so it gives the duplicate text instead of the row number (This will be on the new version.)

uklomp commented 9 months ago

@susanodd thank you very much for your help! I uploaded the csv you gave me and it worked now. I will try again in January with new lists and hopefully it runs smoothly then :).

uklomp commented 9 months ago

Hi @susanodd - I made a mistake with this file: all the glosses should have been in capital letters. I now tried to update it with 'Import CSV update', I uploaded a CSV file and ticked the radio button 'semicollon', had the right headers etc, and now I get the following error (?), in a red field: Attempt to update Lemma ID Gloss translations for Signbank ID. This error occurs for every gloss that I tried to update, followed by: Row 1: Attempt to update Lemma ID Gloss translations for Signbank ID (45917). Row 2: Attempt to update Lemma ID Gloss translations for Signbank ID (45918). Row 3: Attempt to update Lemma ID Gloss translations for Signbank ID (45919). Row 4: Attempt to update Lemma ID Gloss translations for Signbank ID (45920). etc (for the whole file). What does this mean? It doesn't seem to be doing anything, so not updating...?

susanodd commented 9 months ago

@uklomp It's actually programmed this way, it's not allowed to update the lemmas via the gloss update.

I went through the code and it's intended that the pulldown command Import CSV Lemma is used to change the lemmas, not via the Gloss update. This is because glosses can share lemmas. Updating a lemma during update of the gloss would have a side effect on other glosses that share the lemma.

I exported the lemmas and updated one of them to caps. It works okay.

To accomplish what you what to do,

First, export the lemmas, remove the count of glosses column, update the lemmas you want to change to caps, then Import CSV Lemma to update them.
After that, export the glosses again, then update the annotations to caps.

It seems roundabout, but it makes sure that all glosses sharing the lemma also get the caps.

susanodd commented 9 months ago

I modified the feedback to inform the user to update lemmas using Import CSV Lemma. (The feedback improvement is not deployed yet.)

The lemmas have their own ID that is in the export CSV for lemmas.

It's not possible to switch which lemma a gloss has via the Update CSV, because the lemmas are also objects. (I'll check this to be sure.) The Dataset of a gloss is actually stored in the Lemma it is associated with. (@vanlummelhuizen did the Lemma/Dataset models.)

There are checks to look up the lemmas in the lemma columns. But this code is quite old because of all the error checking on the input.

The hover help texts probably need to explain this better as well as the manual.

uklomp commented 9 months ago

@uklomp It's actually programmed this way, it's not allowed to update the lemmas via the gloss update.

I went through the code and it's intended that the pulldown command Import CSV Lemma is used to change the lemmas, not via the Gloss update. This is because glosses can share lemmas. Updating a lemma during update of the gloss would have a side effect on other glosses that share the lemma.

Ok, that makes sense. Should maybe the function 'Import CSV Update' then be specified in that it only updates Glosses?

I exported the lemmas and updated one of them to caps. It works okay.

To accomplish what you what to do,

First, export the lemmas, remove the count of glosses column, update the lemmas you want to change to caps, then Import CSV Lemma to update them.

After that, export the glosses again, then update the annotations to caps.

It seems roundabout, but it makes sure that all glosses sharing the lemma also get the caps.

Ok, I see that. The problem is that I again can't upload my files because the headers seem to be missing (while they're not). Would it be possible to extend the possibilities to accept files not created by LibreOffice? Or should we start using LibreOffice?

susanodd commented 9 months ago

I'll modify the name of the command. It is confusing. Also to me as I hadn't used it for a while.

It would probably be best to start using LibreOffice until we can revise the import code. There are sometimes options to include headers. (There is actually some Python code that detects the header row. Sometimes it ends up not the first row.) But there are also export options that end up putting "worksheet" headers in the file. (From Excel.) Over the years I've had various files to try to debug what is not working. (We ought to collect those for testing purposes.)

(I don't know the export options for Excel, but make sure it doesn't put any worksheet headers in the csv file. There's some proprietary MS stuff in the files sometimes to support "normal" file sharing.)

You could also export to just "tab separated text files". Those you can edit in a text editor. (Or just copy paste the entire worksheet.)

Although at home, testing things, I have been using Apple Numbers on a quite old laptop. I use Export CSV to save the files. Numbers used to have limits to the number of rows, which used to cause problems. But for testing stuff it's okay.

We have been using LibreOffice because of UTF-8 and other alphabets.

uklomp commented 9 months ago

Hi Susan, my export options are shown here: https://github.com/Signbank/Global-signbank/issues/1108#issuecomment-1865916068 I tried the Text (tab delimited) option now, it is not recognized. I understand your point about the headers, I think the CSV file indeed has workbook headers, but SignBank doesnt recognize the file either with or without these headers, so probably another style of headers is necessary and I wouldn't know which one.

Conclusion: I will download LibreOffice now. We will probably try another system of importing new items later, our colleague Gomer is working on that.

susanodd commented 9 months ago

@uklomp you could try to select all the cells including the header row and then copy and paste them into a new file. (Using the same editor or into LibreOffice.) That will remove any extra garbage your software put into the file. Then when you first save the new file, use a filename dot csv. (Doing this in LibreOffice, it asks if you want CSV format.)

The changes I made to the feedback are on signbank-dev. You could try to import your file there to see if it works better.

The import functions check consistency and types of all the input. Usually you remove all the columns that are not relevant to the import. If you export it to CSV, it should work with the comma delimiter. That is the default.

If it reports that it didn't find the headers, then the delimiter is set wrong. (It read past the end of the file looking for them. The update to report this sooner is on signbank-dev.)

susanodd commented 9 months ago

Some screenshots.

Show all lemmas for the dataset. Export to CSV Open in LibreOffice. Remove the column with number of glosses. Use menu command Opmaak -> Tekst -> HOOFDLETTERS on selected columns.

dataset-oefen-show-all-lemmas

open-exported-lemmas-csv-in-libreoffice

remove-column-number-of-signs

text-converted-to-capitals-menu-opmaak-tekst-hoofdletters

susanodd commented 9 months ago

Here is the resulting lemmas csv. (This can be imported and applied.)

dictionary-export-lemmas-oefen-global.csv

Here is what it looks like (locally) after the lemmas are updated:

show_all_signs_oefen_with_updated_lemmas

susanodd commented 9 months ago

Now export the signs to CSV and make the annotations be in caps.

Here, a LOCAL csv with updated lemmas looks like this. Remove all the columns to the right of the annotations.

csv-export-signs-LOCAL-all-columns

csv-oefen-signs-columns-removed

csv-local-oefen-caps-annotations

susanodd commented 9 months ago

Now the glosses CSV with everything in caps can be imported to update them.

Here is a LOCAL screenshot.

feedback-csv-update-glosses

susanodd commented 9 months ago

I modified the names of the commands in the pull-down menu so that it's clear which is for updating lemmas and which for updating glosses.

uklomp commented 9 months ago

Thanks a lot @susanodd - this is really helpful for understanding the steps. I'm also happy that you modified the names of the commands. There is, however, another problem, which is that it turns out we can't download LibreOffice without permission of ICT. We would need to ask the dedicated ICT person if we can have additional software blabla, and there is the possibility that they refuse our request because they feel it isn't necessary. Before I go this way, I would like to know when you think the import code will be revised (with which I mean: revised so that all CSV files and/or maybe excel sheets are recognized)?

susanodd commented 9 months ago

@uklomp did you try from your software the Change File Type, Save As, CSV?

That ought to make a CSV file. (Comma separated file.)

I showed it as LibreOffice because I use Ubuntu and it's free / open source. I don't have Excel because it doesn't work on Ubuntu.

If you start out with an "export CSV" file made by Signbank, then it is a CSV file. If you open / edit that file in your software, it should remain a CSV file as long as it has the suffix CSV.

On Apple, if I open / edit a CSV file, it saves it as "filename.numbers" but then from that file, I can do Export to CSV. (Not Save As.)

If this doesn't work, can you post or send the problem file? We have the opposite problem on Ubuntu, because we can only use Open Source / free code on it. And on Apple, the Numbers app is included in the OS.

susanodd commented 9 months ago

In the screenshots, I showed removing the columns to the right of the annotations. (Since you are only updating the annotations.) On some files, if there are tabs or weird spaces in the Senses, then this can make the columns messed up and text goes into other columns. So it's best to only leave the columns you are updating in the CSV file.

If in your Spreadsheet software you are using special fonts, then the Save As CSV will make it just normal text without fonts and without "special non-printable white space used in typesetting". (Those are things Microsoft does to files to justify printing. Those need to be removed by Save As CSV.)

The option to Save As Text says in will use tab separaters. That will also work, but when you import the file to Signbank, you need to select Tab as the delimiter. (The tabs are easier to read if you want to use a basic "text" (ascii) editor to inspect the file, instead of a word processor.) (If you get to EAF files, the meta data is in a tab-separated file showing the speaker data. That's totally unreadable if it's comma-separated.)

susanodd commented 9 months ago

We can't make it so excel sheets are recognized. There is proprietary data inside of them. (We can't parse stuff that is say binary.)

susanodd commented 9 months ago

@uklomp we plan to deploy the most recent software on global signbank this afternoon by 4pm. It's only on signbank-dev.

susanodd commented 8 months ago

@uklomp do you need a command to "change the lemma of a gloss to a different lemma" ? At the moment, this needs to be done in the signbank interface in Detail View Edit.

I checked and this is not possible via the Import CSV Gloss Update. (This is because the dataset of a gloss is stored in its lemma. There are checks to make sure users do not move the glosses into a different dataset.)

Such a functionality would need to be a separate import-update command, probably using the Lemma ID of the target lemma.

susanodd commented 8 months ago

@uklomp I went ahead and changed the lemma and annotation texts to caps on global, as described above.

Jetske commented 8 months ago

@uklomp I just tried in Excel by opening an exported dataset CSV file via Data -> 'from text/CSV' and apparently when then saving that in format 'CSV UTF-8 (Comma delimited) (*.csv)' ,the comma's are changed into semicolons. Confusing! The reason for that is, I think, that it takes the default delimiter of Windows, which is indeed a semicolon (depending on your region), instead of Excel's default comma delimiter. That would explain what happened. Other than that, Excel seems to be ok to use.

Before uploading a CSV file to Signbank, you can open it in a simple text editor such as notepad/kladblok to check which delimiter is actually used.

uklomp commented 8 months ago

Thanks for trying as well, Jetske, but I keep on getting the same error, which is that the headers are missing. I'm attaching the file. themalijst zorg csv met titels.csv

uklomp commented 8 months ago

@uklomp I went ahead and changed the lemma and annotation texts to caps on global, as described above.

for the lemma's in the oefenset you mean? thanks so much :) I did want to do that as well, but the glosses I wanted to change at the moment were the glosses that you and I tried to add in batch a few weeks ago, to the NGT dataset.

susanodd commented 8 months ago

@uklomp I changed the annotations of those glosses as well. I see there is a new issue for moving the glosses.

As I understand it, you want the glosses moved? Or do you want them copied to NGT? (Copying would create new IDs for all the objects.)

We can work out both ways. We haven't done this before. Although @ocrasborn has done huge batch things in the past.

susanodd commented 8 months ago

@uklomp I see that there are more replies from you in the email rather than on github.

The file themalijst.zorg.csv.met.titles.csv

Needs to be done in two batches. The lemmas need to be modified separately with the "Lemma ID" column (obtained via lemma export).

susanodd commented 8 months ago

@uklomp I modified the glosses for Zorg to be in caps, with a few exceptions.

They are visible in Recently Added Glosses.

A few were duplicats and it was not possible to toggle the case, such as Avatar, which already appear. (So I removed it from the csv. It's still there with lower case.) And for a couple, I added the word (ZORG) to the end. (Such as Administer. ADMINISTER (ZORG).)

If you look at Recently Added you can see what they look like now. (And the ones that are not caps are probably duplicates. But I did not delete them.) I checked, the Autism is a duplicate, so you want to delete the new one that has no phonology. The original is in caps.

susanodd commented 8 months ago

[SPREADSHEETS]

In doing the above, I came across the "changing to capitals" glitch. With LibreOffice, it converted the values to caps. But on Apple Numbers, it merely changed the "appearance" so when it was exported, they were still lower case. I had to copy the contents of each column to a (simple basic) text editor to convert the text to caps, then copy paste it back in the column.

So beware if your caps in your spreadsheet are merely "appearence" rather than caps.

susanodd commented 8 months ago

The original NGT gloss AUTISME only has a Dutch annotation. The newer gloss (now in caps) has both Dutch and English but no phonology. It's not possible to add English to the original gloss because that text (in caps) is now being used by the newer gloss. (@vanlummelhuizen designed it to be this way.)

UPDATE: I went ahead and deleted the newer AUTISME (the one without phonology). Then added English to the original glosses Lemma. That was easier said that done. First the newer gloss has to be deleted. Then search on Aut (Dutch) to find the lemma (that then had no glosses) and delete the lemma. Then add English to the original gloss.

uklomp commented 8 months ago

Hi Susan, thank you very much for going ahead and changing it all to caps!

susanodd commented 8 months ago

Hi, I tried to import a CSV file to create 100 new lemma's at the same time in the NGT dataset. I clicked "Import CSV create" (in the Gebaren menu) and followed the instructions from the manual and the instructions that are shown there, so: to import a CSV file with certain headers (dataset, gloss ID, lemma ID). I then get the error "Incorrect Delimiter: comma." - what am I doing wrong? I'm attaching the file so you can see if something's going wrong in the specifications there. themalijst 23-12-20.csv

@uklomp I just tried this on signbank-dev and here is the result of importing your original file (themalijst.23-12-20.csv) using Import CSV Create New Glosses:

excel-exported-csv-feedback-thema-zorg

susanodd commented 8 months ago

The delimiters are automatically detected.

susanodd commented 8 months ago

@uklomp This is completed and is live on Signbank.

However, there is an ambiguity. There only exists CSV Create New Glosses command. Via this command, new lemmas are created. [ CODE DETAIL It is necessary because the Dataset is stored in the gloss's lemma So it needs to get a lemma at the same time as creation. In the CSV creation, it gets a new lemma this way. If you need a different lemma, it needs to be set elsewhere to an existing lemma.]

There is a CSV Update Lemmas command. Via this command the text of the lemma can be changed. [CODE DETAIL The lemmas can be shared by glosses in the same dataset so this will have a side effect on other glosses in the same lemma group.]

I modified the CSV Update Lemmas command to also accept the Signbank ID (of the gloss) in order to make it more convenient (See the new command on Signbank and its new description of required columns.) Because, if you first tried the CSV Update Glosses command and got an error that it was not allowed to update the lemma fields. now you can upload that same file using the CSV Update Lemmas command.

(We probably need better names in the pulldown menus for these commands. That can be changed in the Admin, also for different languages.)

susanodd commented 8 months ago

@uklomp Now for the ambiguity. As mentioned above it is not possible to "change" the lemma of the gloss via the CSV. It's only possible to update the lemma text fields.

Probably what is needed is a kind of toggle in the CSV Lemma Update: When a Signbank ID and/or a Lemma ID is provided:

allow to "move" the gloss to a different lemma by updating the Lemma ID (both Signbank ID and Lemma ID are needed)
update the text of the lemma by updating the text fields (only one of the fields is needed)

The difference would be that by 2, the gloss stays in the same lemma group, but the lemma group text is changed. By 1, the gloss would be moved to a different lemma group.

At the moment, only 2 is done. (The lemma is updated, not its glosses.)

Any ideas how best to describe and offer this to the user?

It seems necessary to give feedback when one updates a lemma on which other glosses will be affected.

uklomp commented 7 months ago

Hmm, I'm thinking but I don't have a solution yet.

I was thinking about coming to Nijmegen every once in a while so we could collaborate "live". I think this issue would be suitable for such a live meeting. We could talk about the names in the pulldown menu like you suggested, the explanations/instructions, etc... Do you agree? Maybe we can also make a label 'live collaboration'? :)

susanodd commented 7 months ago

Yes, that's a good idea!

uklomp commented 3 months ago

Book12.csv Book1.csv

Woseseltops commented 3 months ago

As just discussed, there seem to be 2 problems when importing exported CSVs:

MS Excel by default exports Windows-1252, instead of UTF-8. Let's add a try except EncodingError in case loading the default utf-8 fails. Alternatively, we could use the chardet library to detect the encoding before importing. This problem can be reproduced with the file Book1.csv in the previous comment
Because these CSVs really have all columns, all columsn should work correctly. We already saw in the error message that there seems to be a problem for dialects, but there might be more. This problem can be reproduced with the file Book12.csv in the previous comment.

susanodd commented 2 months ago

As just discussed, there seem to be 2 problems when importing exported CSVs:

* MS Excel by default exports `Windows-1252`, instead of `UTF-8`. Let's add a `try except EncodingError` in case loading the default `utf-8` fails. Alternatively, we could use the `chardet` library to detect the encoding before importing. This problem can be reproduced with the file `Book1.csv` in the previous comment

* Because these CSVs really have all columns, all columsn should work correctly. We already saw in the error message that there seems to be a problem for dialects, but there might be more. This problem can be reproduced with the file `Book12.csv` in the previous comment.

I repaired the error for the second file involving Dialects. The upload routine correctly reports feedback for errors now as expected.
For the first one, I modified the code by removing code that attempted to detect the format using Python Magic. (That code was broken. Now the code merely reports that the format is not recognised and does not crash.
I can't work on any Windows formats because I have no way of working with those files. (Libre Office corrects them.)

Signbank / Global-signbank

import CSV to create new lemma's #1108