cemsbv / pygef

Parse soil measurement data.
https://cemsbv.github.io/pygef
MIT License
31 stars 19 forks source link

read functions should return insightful errors #342

Open tlukkezen opened 1 year ago

tlukkezen commented 1 year ago

The read_cpt and read_bore functions have some "automagical" logic that infers the content of the file argument. The user can provide an object of types io.BytesIO | Path | str and with "engine"="auto", the content type is inferred automatically. This can result in confusing errors when erroneous input is provided.

Some examples:

Providing a non-existing path results in XMLSyntaxError

Input:

from pygef import read_cpt
read_cpt(file="non/existing/file.gef")

Response:

lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

The expectation is to get a FileNotFoundError

Providing a non-existing path and engine="gef" results in ValueError

Input:

from pygef import read_cpt
read_cpt(file="non/existing/file.gef", engine="gef")

Response:

ValueError: The selected gef file is not a cpt. Check the REPORTCODE or the PROCEDURECODE.

The expectation is to get a FileNotFoundError

Providing an erroneous gef file results in XMLSyntaxError while gef can be parsed when forced

Input:

from pygef import read_cpt
read_cpt(file="path/to/erroneous.GEF")

Response:

lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Input:

from pygef import read_cpt
read_cpt(file="path/to/erroneous.GEF", engine="gef")

Response:

CPTData(bro_id=None, research_report_date=None, ...

The expectation is to get an error that the gef file is invalid, and this response should be consistent no matter the value for engine.

tlukkezen commented 1 year ago

In the current implementation, we just try to parse the possible contents of file and if it fails we move on to the next. The last parsing option that fails (parsing as bro-xml by default) is the one that provides the error. This is random behaviour and is the reason that the error that is returned doesn't reflect the actual problem.

I see the following resolutions now:

RDWimmers commented 1 year ago

I see two options now, both breaking changes:

  • Require more content about the file argument (e.g. with a string_content=["path","file"] argument, but that would only have a meaning for file arguments of type str (kinda ugly)
  • As pandas.read_csv() does: Accept str | io.StringIO types for file and always infer the str type as a Path, and the io.StringIO as file content. This would require the user to convert a string to StringIO object

I like the second one. we can set the type of filepath_or_buffer to Path | io.StringIO. then we dont use strings and can use the Path.is_file() method.

tversteeg commented 1 year ago

When a string has the form of a path separated by slashes or even a single short word it can never be a valid XML or GEF file right? All XML files need to start with a < character, so that's easy, and all GEF files have the form of key: value, so I think a proper heuristic would be:

Is almost certainly path if:

  1. Is the length <= 255 characters
  2. Does it end with case-insensitive .gef or .xml
  3. Does it not contain any < or : characters
  4. Can it be opened from disk