anki-geo / ultimate-geography

Geography flashcard deck for Anki
https://ankiweb.net/shared/info/2109889812
Other
800 stars 79 forks source link

Normalise use of quotes in data.csv #129

Closed axelboc closed 5 years ago

axelboc commented 5 years ago

A side question that became a much larger and interesting discussion! 😄

The core of the issue is that, on one hand, saving data.csv with a spreadsheet editor like Microsoft Excel or LibreOffice Calc leads to double quotes being removed and/or added to fields (most likely depending on versions and settings). On the other hand, Anki Deck Manager, through the composer index command, doesn't wrap all fields in quotes but ends up wrapping a lot more than is technically needed by the CSV format.

Contrary to your experience, @ohare93, I found that Excel Pro 2013 doesn't wrap all fields in quotes. On my machine, it keeps the bare minimum amount of quotes, just like LibreOffice as per @aplaice's experience. Which version of Excel are you using?

If we find that most editors behave consistently by removing all unneeded quotes, the problem lies with AnkiDM. If so, I'll open an issue in the project's repo and then commit a normalised version of data.csv with most quotes removed.

ohare93 commented 5 years ago

@axelboc hmm, I am using the latest version of Excel, however I now find that I am getting a different result depending on which computer I use.

I think the most important thing surrounding this issue is to make sure that we each do not put in pull requests with a bunch of useless changes, such as changing which fields are covered by quotes. Such useless changes will serve to hide the real changes, and make comprehension of each other's work more difficult due to the noise in the diffs. This is the reason why I complained about this in the first place, as at the time I thought it was the only way I could use Excel/LibreOffice, by converting the quotes, and I did not want to do so 😅

Now that we know that AnkiDM is what sets the quotes to their current setup, this is not technically an issue. Though it is indeed a chore to run composer index each time after making a change, and it would be prefered to have a more standard method 👍 I don't know how active AnkiDM's creator will be about "fixing" this, but he tells me he is working on an AnkiDM successor called DataMincer just now 😉 Though it is technically a much larger program, in which AnkiDM is merely a plugin, so perhaps he can still help 👍

aplaice commented 5 years ago

(In brief: php's fputcsv is badly designed.)

Some digging (to the extent that I understand PHP, which is low) suggests that these lines and in particular the fputcsvs, are "to blame" for anki-dm's overzealous (or under-zealous, depending on your point of view) enclosing of fields.

(For a simple demonstration, you could put the following in some php file, say test_fputcsv.php:

<?php
$file = fopen("test.csv","w");

fputcsv($file, array ("a","b","c d"));

fclose($file);
?>

and execute it with php -f test_fputcsv.php.)

Unfortunately, fputcsv is a built-in function and there doesn't seem to be a straightforward way to override its behaviour, one way or another.

There are some workarounds here:

https://stackoverflow.com/questions/2489553/forcing-fputcsv-to-use-enclosure-for-all-fields

and possibly here:

https://stackoverflow.com/questions/2514597/php-fputcsv-and-enclosing-fields

(No idea how feasible they are, as I have no experience with PHP.)


Obviously, before we request for changes to anki-dm, we'd need to decide on which convention is preferred (or whether to stick with the current situation where you edit manually or re-run composer index after editing with a spreadsheet).

axelboc commented 5 years ago

Arf, I've noticed another problem with using Excel... By default, it doesn't understand UTF-8 encoding in CSV files and doesn't save them back to UTF-8, so it sucks for German characters! I found some workarounds, but nothing practical...

aplaice commented 5 years ago

By default, it doesn't understand UTF-8 encoding in CSV files

Probably adding a byte-order mark 0xEF 0xBB 0xBF to the start of the CSV file would solve this, though I don't have excel to test this.

This might be among the impractical workarounds you mentioned, since it does have many side-effects — many editors will eat the BOM upon re-saving, as will libreoffice, according to my test just now. anki-dm at least, could be patched to prepend the BOM "manually". (I know that python has the utf-8-sig encoding which automatically prepends the BOM, but I don't think that PHP does.)

FWIW LibreOffice doesn't suffer from this problem, though.


I'm not sure if it's useful, but I've written a python test that can detect whether a CSV is "correctly" normalised/quoted. It can currently check for compliance with any of the options (depending on what, if anything, we settle on):

  1. The current mostly-followed convention (fields containing a comma, space, quotation mark or new line need to be quoted).

  2. Minimal quoting (only fields containing a comma, quotation mark or new line need to be quoted).

  3. Full quoting (all fields need to be quoted).

I had an idea that there might be a Travis "build" that would check whether src/data.csv is still normalised after a pull request. I'm not sure whether this wouldn't be over-kill and whether it might not just scare-off contributors (seeing a "failed build" message is rather discouraging, and even current master would fail option 1., because the tags in the Kazakhstan row aren't quoted...).

aplaice commented 5 years ago

I've also written a hackish "proof-of-concept" patch for anki-dm to use minimal quoting (option 2. above) when indexing and creating data.csv. Adapting it to option 3., instead, would be pretty straightforward.

I haven't opened it as a pull-request upstream at https://github.com/OnkelTem/anki-dm since I'm not sure whether we will actually want to switch quoting type, but at least changes to anki-dm shouldn't be a major road-block if we do decide to switch. (Well, unless OnkelTem decides that my PHP code is completely abominable, which is quite possible, since I've never really written PHP.)

ohare93 commented 5 years ago

@axelboc the latest versions of Excel have UTF-8, at least. Which version are you using? I am using the version bundled with Office 365.

image

If anyone accidentally saves a file without UTF-8 they should notice from the change log, that some of the German characters change. Right? So people can use whatever software suits them 👍

axelboc commented 5 years ago

I'm using Excel 2013. I managed to open the file as UTF-8 with the BOM character trick, but I don't have the option to save it back to UTF-8 in this version, so the BOM character gets removed and all the German characters get the wrong encoding. Anyway, not a big deal, my version of Excel clearly sucks, so I'll use LibreOffice, a VSCode extension or some other CSV editor in the future.

When it comes to the quotes, I did think of adding a CI pipeline, but yeah, it would suck for a new contributor to see a failed build on their first pull request...

I think we should move forward with requesting a change to Anki-DM. My vote goes to minimal quoting, but if you both prefer full quoting, that's fine by me too. Either way, it's better than the current PHP-way of quoting.

Erim24 commented 5 years ago

I noticed the problem with the quoting as well and to avoid noise in the diffs (as mentioned earlier) I removed them by hand. A quick test with my LibreOffice though showed that there is an option to enclose every field in quotes, which would make it easiest to switch to full quoting

axelboc commented 5 years ago

I've added a section to the contributing guide about running composer index before committing. Let's see what happens on AnkiDM's side.