PEtab-dev / libpetab-python

Python package for working with PEtab files
https://libpetab-python.readthedocs.io
MIT License
14 stars 6 forks source link

writer functions do not perform sanity checks + use misleading options for writing tsv #26

Open FFroehlich opened 3 years ago

FFroehlich commented 3 years ago

When having dedicated writer functions such as petab.conditions.write_condition_df etc, I would actually expect that they perform some kind of sanity check on the output. Having pretty bare wrappers around pd.to_csv doesn't seem to helpful, especially since the writers use index=True, which is not consistent with the providede spec and, since the readers don't use index_col=0, leaves an Unnamed 0 column in the imported DataFrame.

dweindl commented 3 years ago

The index mess shouldn't be there, yeah.

Having pretty bare wrappers around pd.to_csv doesn't seem to helpful

They are there so they can be filled as soon as somebody motivated comes in to do so.

Regarding output checking, I am not sure whether we want to do that by default. I would find it more meaningful to check the complete set of files afterwards, not individual ones. But whether this should happen by default is also up for discussion.

dweindl commented 3 years ago

since the readers don't use index_col=0

That is on purpose, as the order of columns is arbitrary. All readers should set the correct index by name.

FFroehlich commented 3 years ago

since the readers don't use index_col=0

That is on purpose, as the order of columns is arbitrary. All readers should set the correct index by name.

They don't. Also pd.to_csv will always put the index as first column. If the readers set the index based on a column name, it doesn't make too much sense to write the index itself to file.

FFroehlich commented 3 years ago

The index mess shouldn't be there, yeah.

Having pretty bare wrappers around pd.to_csv doesn't seem to helpful

They are there so they can be filled as soon as somebody motivated comes in to do so.

It is one line of code. If one wants to save time and effort, why bother implementing them in the first place.

Regarding output checking, I am not sure whether we want to do that by default. I would find it more meaningful to check the complete set of files afterwards, not individual ones. But whether this should happen by default is also up for discussion.

Running on the files individually only performs a subset of checks, which is better than nothing IMHO. I think running a lint on the whole also makes sense for both input and output makes sense but is currently done for neither.

dweindl commented 3 years ago

If the readers set the index based on a column name, it doesn't make too much sense to write the index itself to file.

Not sure I understand what you mean. If we don't write the index, conditionId will be lost.

FFroehlich commented 3 years ago

If the readers set the index based on a column name, it doesn't make too much sense to write the index itself to file.

Not sure I understand what you mean. If we don't write the index, conditionId will be lost.

Okay I think I am slowly understanding what the issue is: the format for the condition table within the petab library and in the tsv is inconsistent with the pandas conversion between the two formats, which ultimately leads to issues when rountripping (which can be resolved though) between the two formats and a lot of confusion on my end.

I think there is are two options to have it consistent:

1) HaveconditionId as index, force this to be the first column in the condition tsv file, write with pandas.to_csv(..., index=True) and load with `pandas.read_csv(..., index_col=0)

2) Have conditionId as column, allow arbitrary column location in the condition tsv file, write with pandas.to_csv(...,index=False) and load with pandas.read_csv(...)

What is currently happening is a mix between the two approaches, which means that internal pandas.DataFrame representation of the condition table, where the conditionId is the index, is different from the external pandas.DataFrame representation of the condition table, where the conditionId is a column. Accordingly seperate writers/readers are necessary for the internal/external->internal/external. Currently petab.conditions.get_condition_df implements a reader for external/internal -> interal format, but petab.conditions.write_condition_df implements a writer for internal->internal (which imho isn't too helpful at all). Accordingly the two do not roundtrip, which is what I would have expected.

Event though confusing, I do understand reasons to keep the distinction between the two formats. For that to make sense, I think one should implement a public writer that writes external -> external and a private writer that writes internal->external that roundtrips with the reader. However, I (i) think this should be clearly documented somewhere (docstring of readers/writers?) (ii) bet my ass that someone is going to introduce a bug where he uses the public instead of the private writer.

dweindl commented 3 years ago

I am still not completely sure in which case these problems occur. We do test round-tripping for reading/writing in https://github.com/PEtab-dev/PEtab/blob/768eb5b8c145a5de2ee6e054e61120d140de4be8/tests/test_conditions.py#L71-L84

The problem occurs if you pass some self-created DataFrame through petab.to get_condition_df? With conditionId set as index or not?

FFroehlich commented 3 years ago

I am still not completely sure in which case these problems occur. We do test round-tripping for reading/writing in

https://github.com/PEtab-dev/PEtab/blob/768eb5b8c145a5de2ee6e054e61120d140de4be8/tests/test_conditions.py#L71-L84

The problem occurs if you pass some self-created DataFrame through petab.to get_condition_df? With conditionId set as index or not?

No petab.get_condition_df is fine, the issue with petab.write_condition_df

condition_df = pd.DataFrame(data={ 
         CONDITION_ID: ['condition1', 'condition2'], 
         CONDITION_NAME: ['Condition 1', 'Condition 2'], 
         'fixedParameter1': [1.0, 2.0] 
})

Does not roundtrip, and thats what I would expect a user to pass to this function, not

condition_df = pd.DataFrame(data={ 
         CONDITION_ID: ['condition1', 'condition2'], 
         CONDITION_NAME: ['Condition 1', 'Condition 2'], 
         'fixedParameter1': [1.0, 2.0] 
}).set_index(CONDITION_ID)

i.e. you would have to use petab.write_condition_df(petab.read_condition_df(condition_df)) in order to write in the correct format.

dweindl commented 3 years ago

Alright. That means changing petab.write_condition_df to df.to_csv(..., index=(not isinstance(condition_file.index, pd.RangeIndex))). That should address the main issue. I'd discuss validating when and where separately.

dweindl commented 3 years ago

or maybe more generally, resetting the index, and using index=False. Also fine.

dilpath commented 2 years ago

Possibly resolved by #126