Should unspecified optional strings be the empty string or NaN?

dilpath commented 4 years ago

At the moment, there can be NaNs (after pd.read_csv) in optional PEtab string columns, such as observableNames, that, if interpreted as a string, are converted to the string literal 'nan'.

>>> import numpy as np
>>> str(np.nan)
'nan'

An issue can occur in the AMICI plotting functions. This issue can be fixed by replacing

elif model.getObservableNames()[iy] != '':

with

elif model.getObservableNames()[iy] in ['', 'nan']:

to correctly identify unspecified observable names. However, testing for the string 'nan' seems unintuitive, and this fix might cause another issue if an observable is named 'nan'.

Here's a solution, which could be implemented in PEtab, and might resolve the issue in AMICI.

$ cat test_str.csv
observableId    observableName
a_id    a_name
b_id

>>> import pandas as pd
>>> df1 = pd.read_csv('test_str.csv', sep='\t')
>>> df2 = pd.read_csv('test_str.csv', sep='\t')
>>> df2['observableName'] = df2['observableName'].fillna('')
>>> df1
  observableId observableName
0         a_id         a_name
1         b_id            NaN
>>> df2
  observableId observableName
0         a_id         a_name
1         b_id

yannikschaelte commented 4 years ago

in general, all of '', 'nan' (, 'NaN', ...) should be allowed as input in the dataframe and interpreted as "Nothing". I am not sure why the observable name is read in as a NaN here, I would have thought the column is interpreted as string. But apparently not ... . The question would be where to put this conversion. At the moment, we perform little value checking when reading in the csv files in petab, so one could do it in AMICI, or here but then also for other potentially problematic columns ... preferences @dweindl @LeonardSchmiester ?

dilpath commented 4 years ago

I would have thought the column is interpreted as string

The columns are correctly identified as object type, as opposed to the int64 or float64 types of other columns, but missing values are still replaced with NaN in the pandas dataframe. Specifying the column type in the read_csv command also doesn't stop the insertion of NaNs.

>>> df1.dtypes
observableId      object
observableName    object
dtype: object
>>> df3 = pd.read_csv('test_str.csv', sep='\t', dtype={'observableName': str})
>>> df3.dtypes
observableId      object
observableName    object
dtype: object
>>> df3
  observableId observableName
0         a_id         a_name
1         b_id            NaN

The question would be where to put this conversion.

One way would be to replace

PARAMETER_DF_OPTIONAL_COLS = [
    PARAMETER_NAME, NOMINAL_VALUE,
    INITIALIZATION_PRIOR_TYPE, INITIALIZATION_PRIOR_PARAMETERS,
    OBJECTIVE_PRIOR_TYPE, OBJECTIVE_PRIOR_PARAMETERS]

at https://github.com/PEtab-dev/PEtab/blob/master/petab/C.py#L78 with

PARAMETER_DF_OPTIONAL_COLS_STR = [
    PARAMETER_NAME,
    INITIALIZATION_PRIOR_TYPE, INITIALIZATION_PRIOR_PARAMETERS,
    OBJECTIVE_PRIOR_TYPE, OBJECTIVE_PRIOR_PARAMETERS]
PARAMETER_DF_OPTIONAL_COLS += [NOMINAL_VALUE]

then insert

parameter_df[PARAMETER_DF_OPTIONAL_COLS_STR] = \
    parameter_df[PARAMETER_DF_OPTIONAL_COLS_STR].fillna('')

here https://github.com/PEtab-dev/PEtab/blob/master/petab/parameters.py#L37 and here https://github.com/PEtab-dev/PEtab/blob/master/petab/parameters.py#L41

Similarly for observables/other tables.

dweindl commented 4 years ago

One option would be to give explicit data types for each column in read_csv. That may also speed up loading larger files. Not sure if this works if some columns are not present. Another problem is, that some columns may be float or object, depending on the specific file (e.g. condition table with all numeric parameters vs. parameter IDs).

Otherwise, checking everything with pandas.isna instead of comparing to '' should also be okay.

FFroehlich commented 3 years ago

fixed in https://github.com/AMICI-dev/AMICI/pull/1296/commits/6dfac95762d4eab46ee9ab9c9657d60f35d1d386 although I really don't think this should be fixed in amici.

Ensuring that names are strings would be a good start and shouldn't be too much trouble.

PEtab-dev / libpetab-python

Should unspecified optional strings be the empty string or NaN? #25