read functions should return insightful errors

tlukkezen commented 1 year ago

The read_cpt and read_bore functions have some "automagical" logic that infers the content of the file argument. The user can provide an object of types io.BytesIO | Path | str and with "engine"="auto", the content type is inferred automatically. This can result in confusing errors when erroneous input is provided.

Some examples:

Providing a non-existing path results in `XMLSyntaxError`

Input:

from pygef import read_cpt
read_cpt(file="non/existing/file.gef")

Response:

lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

The expectation is to get a FileNotFoundError

Providing a non-existing path and `engine`="gef" results in `ValueError`

Input:

from pygef import read_cpt
read_cpt(file="non/existing/file.gef", engine="gef")

Response:

ValueError: The selected gef file is not a cpt. Check the REPORTCODE or the PROCEDURECODE.

The expectation is to get a FileNotFoundError

Providing an erroneous gef file results in `XMLSyntaxError` while gef can be parsed when forced

Input:

from pygef import read_cpt
read_cpt(file="path/to/erroneous.GEF")

Response:

lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Input:

from pygef import read_cpt
read_cpt(file="path/to/erroneous.GEF", engine="gef")

Response:

CPTData(bro_id=None, research_report_date=None, ...

The expectation is to get an error that the gef file is invalid, and this response should be consistent no matter the value for engine.

tlukkezen commented 1 year ago

In the current implementation, we just try to parse the possible contents of file and if it fails we move on to the next. The last parsing option that fails (parsing as bro-xml by default) is the one that provides the error. This is random behaviour and is the reason that the error that is returned doesn't reflect the actual problem.

I see the following resolutions now:

Require more content about the file argument (e.g. with a string_content=["path","file"] argument, but that would only have a meaning for file arguments of type str (kinda ugly) (Breaking change)
As pandas.read_csv() does: Accept str | io.StringIO types for file and always infer the str type as a Path, and the io.StringIO as file content. This would require the user to convert a string to StringIO object. (Breaking change)
Return a general custom exception (e.g. CPTParsingError) when all parsing options have failed. Although this will still return a ValueError for an erroneous gef-file path.

RDWimmers commented 1 year ago

I see two options now, both breaking changes:

Require more content about the file argument (e.g. with a string_content=["path","file"] argument, but that would only have a meaning for file arguments of type str (kinda ugly)

As pandas.read_csv() does: Accept str | io.StringIO types for file and always infer the str type as a Path, and the io.StringIO as file content. This would require the user to convert a string to StringIO object

I like the second one. we can set the type of filepath_or_buffer to Path | io.StringIO. then we dont use strings and can use the Path.is_file() method.

tversteeg commented 1 year ago

When a string has the form of a path separated by slashes or even a single short word it can never be a valid XML or GEF file right? All XML files need to start with a < character, so that's easy, and all GEF files have the form of key: value, so I think a proper heuristic would be:

Is almost certainly path if:

Is the length <= 255 characters
Does it end with case-insensitive .gef or .xml
Does it not contain any < or : characters
Can it be opened from disk

cemsbv / pygef