PEtab-dev / PEtab

PEtab - an SBML and TSV based data format for parameter estimation problems in systems biology
https://petab.readthedocs.io
MIT License
59 stars 12 forks source link

False positive invalid condition ID. #477

Open FFroehlich opened 3 years ago

FFroehlich commented 3 years ago

Which problem would you like to address? Please describe. Linting complains about invalid condition IDs. I believe the error is due to the fact that the condition id is numeric only. Ideally the error would be a bit more informative what the actual issue is with the condition ID.

Describe the solution you would like Change the spec to be more accurate about identifier restrictions or change linting to be more permissive about identifiers.

Describe alternatives you have considered Don't use numbers in identifiers.

Additional context Can be checked via check_condition_df(pd.DataFrame({petab.CONDITION_ID: ['0']}).set_index(petab.CONDITION_ID), None).

Not sure how much sense it makes to print line numbers when operating with a DataFrame.

Typehints in check_condition_df are inconsistent with the function signature, i.e., sbml_model is not and optional argument.

dweindl commented 3 years ago

Linting complains about invalid condition IDs. I believe the error is due to the fact that the condition id is numeric only.

Correct.

Change the spec to be more accurate about identifier restrictions or change linting to be more permissive about identifiers.

https://petab.readthedocs.io/en/latest/documentation_data_format.html#overview says:

All identifiers must consist only of upper and lower case letters, digits and underscores, and must not start with a digit.

But it seems to be overlooked to easily in the preamble. So maybe better to include that sentence with for every ID column.

Not sure how much sense it makes to print line numbers when operating with a DataFrame.

Dataframe row index preferred? Not really sure which one is more helpful...

Typehints in check_condition_df are inconsistent with the function signature, i.e., sbml_model is not and optional argument.

:+1:

FFroehlich commented 3 years ago

Not sure how much sense it makes to print line numbers when operating with a DataFrame.

Dataframe row index preferred? Not really sure which one is more helpful...

I think printing the regex it needs to match would be more helpful than any line numbers. The row index would be equal to the offending ID here so that doesn't really add anything new. I don't think any further info is necessary and line numbers are misleading for non-file inputs.

dilpath commented 3 years ago

Line numbers were added in #467 after a couple of people had issues with whitespace-only lines being added to the end of their PEtab TSV files (possibly after exporting from Excel.. not sure), which then caused errors (e.g. #466).

If line numbers are removed, then a check for whitespace-only lines in TSV files/rows in dataframes might be useful, as the ID in the error message may appear blank.

I guess the line numbers aren't always correct for files either e.g. when using multiple parameter subset files that are combined into a single dataframe.

FFroehlich commented 3 years ago

Line numbers were added in #467 after a couple of people had issues with whitespace-only lines being added to the end of their PEtab TSV files (possibly after exporting from Excel.. not sure), which then caused errors (e.g. #466).

If line numbers are removed, then a check for whitespace-only lines in TSV files/rows in dataframes might be useful, as the ID in the error message may appear blank.

I guess the line numbers aren't always correct for files either e.g. when using multiple parameter subset files that are combined into a single dataframe.

I think when importing from files, it makes sense to print line-numbers, when importing from DataFrames, not so much. Should be possible to parse the input type somewhere.

dweindl commented 3 years ago

I guess the line numbers aren't always correct for files either e.g. when using multiple parameter subset files that are combined into a single dataframe.

Good point.

So how shall we handle that? Drop line numbers? Just print the offending row? It's not too helpful for whitespace-only rows, but they should be relatively easy to spot anyways.

dilpath commented 3 years ago

So how shall we handle that? Drop line numbers? Just print the offending row? It's not too helpful for whitespace-only rows, but they should be relatively easy to spot anyways.

Yes, but if pd.isna() then add a hint in the error message to suggest that the user checks for whitespace-only lines, instead of printing the row, should be sufficient.