Input musn't be finnicky #7

Closed timdiels closed 3 years ago

In GitLab by @timdiels on Apr 3, 2019, 16:32

Column names with " in the name, a row with fewer columns than the rest, strange line endings, trailing tabs, consecutive tabs, UTF16-LE BOM encoding, dos line endings. All these weird cases should either be supported or yield an error which tells the user exactly what's wrong and how to fix it. Plus it shouldn't look like a traceback, users don't read those because that's not how you show errors to users, that's just for devs.

So:

[ ] Handle various encodings
[ ] Handle strange line endings
[ ] Whitespace is insignificant, i.e. it can't break things
[ ] User friendly error messages for common input errors

In GitLab by @timdiels on Apr 3, 2019, 17:02

Python's utf-8-sig encoding (to be used on open) may solve BOM problems, not sure whether it solves UTF16-LE though. dos2unix does convert UTF16-LE to BOM though.

But perhaps simply using YAML will support UTF16-LE etc out of the box, or at least provide helpful errors. \t as separator was a mistake, it should be replaced by something users can see when reviewing their input for errors.

It probably also helps to be explicit about rows, e.g. a likely format for a matrix is

[[null,   'col1', 'col2'],
 ['row1', 123.45, 123.45],
]

This format seems easiest for a user to spot errors in. Realistically the columns won't be aligned though, and perhaps we should have a header: ['col1', 'col2'] to avoid the initial null. YAML is also easier to create files for programatically.

E.g. this is harder to give a user-friendly error for:

[null, 'col1', 'col2',
 'row1', 123.45,
 'row2', 123.45,
 'row3', 123.45,
]

A column is missing but instead it is interpreted as:

[null, 'col1', 'col2',
 'row1', 123.45, 'row2',
 123.45, 'row3', 123.45,
]

and likely it will complain about 'row2' not being a float, which is more puzzling than 'row 1 should have 3 columns, it has 2'. Or worse we might not catch it at all and some random error is triggered in the middle of the program.

In GitLab by @timdiels on Apr 3, 2019, 17:05

Some input errors can be fixed with a FAQ that step by step helps users fix the problem, ideally though we rely more so on our error messages giving them the steps to fix the issue.

Note that people often start from xls or csv and that is actually the cause of UTF16 shenanigans and whitespace issues. The FAQ should cover how to go from xls to YAML.

In GitLab by @timdiels on Apr 3, 2019, 18:34

Sometimes integers are used as gene names, or the name starts with a number. This too should be possible; or the FAQ should mention to prefix those numbers with a letter. Sometimes pandas tries to be smart and loads a gene/row name or condition/column name as a number whereas we always want to treat them as strings. In YAML it's also possible that these will be loaded as numbers instead of as strings, we could convert these to str explicitly, similarly we could convert values explicitly to numbers; but better be explicit and force it upon the user to properly quote or not quote values as otherwise this automatic conversion might mask real errors (does it though?). Edit: we could convert with warning.

In GitLab by @timdiels on Apr 3, 2019, 18:35

Do we allow for an index name? If so then [['gene', 'col1', 'col2'],... would make sense instead of having a separate header var for it.

In GitLab by @timdiels on Apr 3, 2019, 18:44

Try each of the following input sets and see if something goes wrong, they caused trouble in coexpnetviz v2.0.0.dev2: look for the old_coexpnetviz_bugs dir, see the website wiki for the full path.

In GitLab by @timdiels on Apr 4, 2019, 11:09

open() can handle any newlines thanks to universal newline mode which is on by default. Newlines to our code will always appear as \n.

open() defaults to using the default encoding for the platform, we should specify an encoding explicitly for consistent behaviour across platforms. pyyaml does not deal with encoding, it only accepts a stream (a file object) or str. utf-8-sig turned out to be unnecessary and it still can only read utf8; in my test utf8 can read files with a BOM just fine, in fact utf-8-sig can cause trouble when writing to existing files, so avoid it unless you want to write with BOM.

Still, in the test battery open() and open(encoding='utf-8') will handle only utf8. By autodetecting it we can handle all reasonable encodings.

My test battery includes: unix line endings, dos line endings, dos and unix line endings mixed, whitespace (spaces/newlines), whitespace with tabs, utf16-le, utf16-le with bom, utf8, utf8 with bom, utf16-le with bom, utf16-le with bom nulchars and whitespace, nulchars (hex 00). Each contains a bit of yaml which I parse with yaml.load (pyyaml).

With the below final script, we correctly parsed: dos.yaml incorrect_line_endings.yaml latin1.yaml utf16_le_bom_dos.yaml utf16le_bom.yaml utf8_bom.yaml utf8.yaml whitespace_tabs.yaml whitespace.yaml. chardet does not support any utf16 without a BOM (it detects ascii), in fact even vim displays it as a garbled mess interpreting it as utf-8, so this is unlikely to form a problem in practice; technically utf16 without a bom is allowed but it's very difficult for a program to guess which encoding it is in without a bom unless it's utf-8.

Nulchars result in this error:

unacceptable character #x0000: control characters are not allowed
  in "nullchars.yaml", position 27

Simply add to the FAQ that in this example the 27th character in the file (not counting newlines) is not allowed in YAML files.

The script used for testing:

from yaml import load, CLoader
import sys
from pprint import pprint
from chardet import detect
for file_name in sys.argv[1:]:
    print(file_name)
    try:
        with open(file_name, 'rb') as f:
            encoding = detect(f.read())['encoding']
            print('Detected', encoding)
        with open(file_name, encoding=encoding) as f:
            pprint(load(f, CLoader))
    except Exception as ex:
        print(ex)
    print()

In GitLab by @timdiels on Apr 4, 2019, 11:20

What about error messages yielded by yaml? Are they friendly?

[['gene', 'col1', 'col2'],
 ['row1', 23.4, 12.4],
 ['row2', 56.8, 34.5],,
]

yields

while parsing a flow node
did not find expected node content
  in "invalid_yaml/valid.yaml", line 3, column 23

[['gene', 'col1', 'col2'],
 ['row1', 23,4, 12,4],
 ['row2', 56,8, 34,5
]

yields

while parsing a flow sequence
  in "invalid_yaml/valid.yaml", line 1, column 1
did not find expected ',' or ']'
  in "invalid_yaml/valid.yaml", line 5, column 1

While a user may not understand what a flow sequence is, errors point to the line and column at which the error happens, sometimes even with a helpful error such as not finding a ]. So by using yaml we do gain friendlier error messages.

More cases to test with the full application (these need to be handled by us, not yaml):

# using , instead of . as float separator
[['gene', 'col1', 'col2'],
 ['row1', 23,400, 12,400],
 ['row2', 56,800, 34,500],
]

# Numbers as col/row names
[['gene', 1, 2],
 [1, 23.4, 12.4],
 [2, 56.8, 34.5],
]

# Numeric strings as values
[['gene', 'col1', 'col2'],
 ['row1', '23.4', '12.4'],
 ['row2', '56.8', '34.5'],
]
# combined with incorrect float separator
[['gene', 'col1', 'col2'],
 ['row1', '23,400', '12,400'],
 ['row2', '56,800', '34,500'],
]

# Missing a column
[['gene', 'col1', 'col2'],
 ['row1', 23.4],
 ['row2', 56.8],
]

# Extra column
[['gene', 'col1', 'col2'],
 ['row1', 23.4, 12.4, 36.3],
 ['row2', 56.8, 34.5, 23.2],
]

# One row missing a column
[['gene', 'col1', 'col2'],
 ['row1', 23.4],
 ['row2', 56.8, 34.5],
]

In GitLab by @timdiels on Apr 4, 2019, 11:28

We could support includes with just a bit of code. Syntax would be:

baits: !include baits.yaml
...

In GitLab by @timdiels on Apr 4, 2019, 11:37

Alternatively to the CLI options, just hand a config file:

baits: !include mybaits.yaml
matrices:
  - !include matrix1.yaml
  - !include matrix2.yaml

or inlined

baits:
  - gene1
  - gene2
matrices:
  - name: matrix1  # must be unique
    data: [['gene', 'col1', 'col2'],
           ['row1', 1, 2]]
  - name: matrix2
    data: [['gene', 'col1', 'col2'],
           ['row1', 1, 2]]

as well as allow other options which are specified on the command line currently.

For the syntax we can simply refer to the YAML docs, the primer which is the short intro.

In GitLab by @timdiels on Apr 4, 2019, 11:38

CLI is handy too though so keep supporting it but don't test it much, low prio, drop for now if it eats up time.

In GitLab by @timdiels on Apr 5, 2019, 14:00

changed the description

In GitLab by @timdiels on Apr 5, 2019, 14:07

Summary todo of all above:

[ ] Implement the YAML format, dropping support for the tsv format
[ ] FAQ: how to convert from xls or old format to YAML
[ ] Always treat row/col names as str, always treat values as float. Warn if the yaml type doesn't match the expected type, but do continue.
[ ] Test with old_coexpnetviz_bugs and our own test set

In GitLab by @timdiels on Apr 5, 2019, 16:34

Warning non-str row/col names should help with detecting the case where the user forgets either the header row or the column with the row names.

In GitLab by @timdiels on Oct 1, 2020, 23:13

removed milestone

In GitLab by @timdiels on Oct 2, 2020, 13:58

In #3 we mentioned supporting just a cytoscape plugin as interface is sufficient, so all we need to worry about is an invalid baits, gene families or expression matrix file. We can pass those as paths to the cli with a json or yaml file as config, which points to the baits, ... files.

[ ] Write tests for valid/invalid xls, xlsx, ods, csv. Ensure they all show a nice error of some sorts that explains the user sufficiently how to fix the problem.
[ ] Allow numbers as header, or names which start with a number
[ ] Allow both integers and floats as expression values
[ ] Drop the tsv format, it's too tricky for users to get right and tricky for us to support all bom, encoding, line endings, ... and all the possible errors such as double tab, spaces instead of tab, ...

In GitLab by @timdiels on Oct 2, 2020, 16:37

moved to varbio#3

CoExpNetViz / coexpnetviz-python

Input musn't be finnicky #7