Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
697 stars 132 forks source link

problems with renaming .csv file extension to .tsv #2278

Open alanfgh opened 4 years ago

alanfgh commented 4 years ago

While the .tsv file extension describes the contents of a tab-separated value file more accurately than the .csv file extension, it raises other issues. For instance, Excel does not recognize the .tsv extension under either "All Excel Files" or "Text Files", so if you look at a directory that contains .tsv files, none of them will show up unless you choose "All Files":

tsv_invisible

By default, OpenOffice Calc does not recognize .tsv, either:

https://wiki.openoffice.org/wiki/Documentation/OOoAuthors_User_Manual/Getting_Started/File_formats#Opening_spreadsheets

However, when you open a .tsv file with Calc, it must analyze the file because it displays a wizard that allows you to choose a delimiter and encoding other than the default:

tsv_in_calc

jiru commented 4 years ago

What a strange world. :disappointed: I get your point, but I don’t like the idea of sacrificing the transparency of file formats just to satisfy the needs of crappy implementations.

In my last implementation, my intention was to provide TXT a format suitable for non-technical users who just want to print out sentences, while TSV can be used by power users who want to process data using an algorithm.

Maybe we can work on providing a proper spreadsheet file format like ODS. What is the use case for a user to open the file in a spreadsheet program?

alanfgh commented 4 years ago

ODS is a compressed XML file format, which feels a little heavyweight for this purpose. With tab-separated text, you have the choice of looking at the file in a text editor or in a spreadsheet editor. I suppose you could uncompress an ODS file and look at it in an XML editor, but that would not be nearly as easy.

I generally use a spreadsheet program to remove duplicate sentences from a list. Let's say I've downloaded a list of 100 Russian sentences to learn. Some of them are going to be linked to multiple English sentences, so I want to choose the English translation I want to keep and delete the other ones (by deleting the rows from the file). Presumably I could do this with a text editor, too, but it's a lot easier to read with a spreadsheet editor because I don't have to look at all the delimiters, and because the contents are aligned by column (though these advantages are slightly offset by the fact that long strings may originally be truncated, requiring me to play with the column widths).

Now that it's possible to download only English sentences, and to load the list into a spreadsheet that fits in memory (though it takes a very long time to load), I have experimented with looking for error patterns throughout the English corpus. But I haven't done that very much yet.

I think it was correct to add the "regression" label because, based on the fact that popular spreadsheets have no conception of a .tsv file extension, it's better to use .csv for these files, even if the name is misleading.

AndiPersti commented 4 years ago

How about using .txt as the extension?

As I understand it:

alanfgh commented 4 years ago

That certainly sounds reasonable. Unfortunately, as jiru noted, it's a strange world, so reasonable doesn't always work. :) If you start Excel, then choose "Open" and go to a directory containing .txt files, they won't show up in the list any more than the .tsv files will. Only .csv files appear.

By contrast, if you start OpenOffice Calc, then choose "Open" and go to a directory containing .txt files, they will show up. But if you try to open one, it will open in OpenOffice Writer, not OpenOffice Calc.

In this imperfect world, there may not be a better file extension than ".csv". And since Tatoeba's download window calls it a "Tab-separated file", at least we're doing our part to let users know what the delimiters are.

agrodet commented 4 years ago

Hold on, hold on. You're going too fast on your analysis. Regression is a big word... Let me go through it point by point:

For instance, Excel does not recognize the .tsv extension under either "All Excel Files" or "Text Files", so if you look at a directory that contains .tsv files, none of them will show up unless you choose "All Files"

That is absolutely correct, since it is designed that way. The fact that Excel put .csv in "All text files", but not .tsv, is a design choice of the software. It's not the TSV format fault. If I were to be picky, I would ask you if you can tell me what a .prn file is, since it seems to be a text file... So I don't see the problem of .tsv extension being in "All files". Pretty logical.

By default, OpenOffice Calc does not recognize .tsv, either. However, when you open a .tsv file with Calc, it must analyze the file because it displays a wizard that allows you to choose a delimiter and encoding other than the default:

Concerning the second part, the same goes for a .csv file. I don't know how your LibreOffice (not OpenOffice :P) is configured but if double-click on a .csv file, before it opens, the Text Import dialog opens to let you choose the extension, just as your screenshot shows. Since it is a tab-separated CSV, if you don't choose "Tabulation" as the separator, things will be messed up.
Concerning the first part, my explanation above holds.

I don't know how people handle CSV files with their Excel but by default, if you try to open a tab-separated CSV file, Excel will get things wrong. Just like if you select the "comma" as a separator in the LibreOffice Text Import dialog. The fact that it "recognizes" the format doesn't matter. You will have to go through the Text Import Wizard and change your settings.

Finally, "recognize" has no real meaning. Everything is "default" and "configured" stuff. Most of the time we forget what we configured and what the default is. Take your .TSV file, for example. Right-click on it > Open with > Choose to always open with Excel, and magically Excel will open it, and properly (unlike a tab-separated CSV). The same works with LibreOffice, although you might need to go until the .exe of the program depending on what programs your windows will suggest you.

If you start Excel, then choose "Open" and go to a directory containing .txt files, they won't show up in the list any more than the .tsv files will. Only .csv files appear.

I think .txt files will appear if you select "All text files", no? If you choose to open it, the Text Import wizard will appear to ask you what delimiter you want to use. The same holds for LibreOffice.

To summarize my ideas:
I can see the difference of burden between a fake CSV and a TSV.

there may not be a better file extension than ".csv".

There is: .tsv :P By providing both .tsv and .txt, I cannot see a case we do not cover, except the "This software don't open the file by double-clicking". Well, that's configuration, there's nothing we can do. If I don't know how to handle a TSV file, I use the txt file, or I ask around. If I'm on a Mac and I want to open a .TSV file with the default "Excel-like" software, I guess I will use "Numbers". "Numbers" opens the .TSV file correctly. It doesn't even need a "Text Import" dialog (But it offers you the possibility to display it in case it mistook). So if I were a hardcore Mac supporter (LOL), I would tell you to switch to Mac! (Thank god, I still have my full reasoning capabilities...)

And since Tatoeba's download window calls it a "Tab-separated file", at least we're doing our part to let users know what the delimiters are.

It looks amateurish. We're not the client that do not know anything about file formats or how computers work. We are the provider that is suppose to know what they're doing. We provide the correct solution. If the incorrect solution has a better value, this needs to be clearly demonstrated and without a doubt. I think I gave enough explanation(s) above about why the point is not clearly demonstrated. If I'm mistaken, of course, I will take back the erroneous argument(s).

jiru commented 4 years ago

@alanfgh Thanks for explaining your use cases.

ODS is a compressed XML file format, which feels a little heavyweight for this purpose. With tab-separated text, you have the choice of looking at the file in a text editor or in a spreadsheet editor.

But this comes at the expense of file type misrecognitions, which is the very topic of this issue. That’s why I think we should avoid this approach of trying to provide a one-fits-all file format. Note that I’m not talking about replacing CSV/TSV with ODS, but making ODS available as an additional download format.

I generally use a spreadsheet program to remove duplicate sentences from a list. Let's say I've downloaded a list of 100 Russian sentences to learn. Some of them are going to be linked to multiple English sentences, so I want to choose the English translation I want to keep and delete the other ones (by deleting the rows from the file). Presumably I could do this with a text editor, too, but it's a lot easier to read with a spreadsheet editor because I don't have to look at all the delimiters, and because the contents are aligned by column (though these advantages are slightly offset by the fact that long strings may originally be truncated, requiring me to play with the column widths).

Note that it’s likely that a well-styled spreadsheet file could remove the need of having to initially adjust the column widths. It should be possible to set the column widths to a reasonable value and to make the text spreads on multiple lines within its cell (so that it’s never truncated). So +1 for the spreadsheet file answering your use case.

May I ask you what do you do with the file after you removed duplicates? For example: do you print it out, import it into Anki or similar, or anything else?

Now that it's possible to download only English sentences, and to load the list into a spreadsheet that fits in memory (though it takes a very long time to load), I have experimented with looking for error patterns throughout the English corpus. But I haven't done that very much yet.

As opposed to lists downloads, that export file is tailored for developers and tech-savvy users, so I think it’s out of the scope of this issue. Nevertheless, I’d like to ask you: what’s the advantage of this method compared to a regular search on the website?