Open tlukkezen opened 1 year ago
In the current implementation, we just try to parse the possible contents of file
and if it fails we move on to the next. The last parsing option that fails (parsing as bro-xml by default) is the one that provides the error. This is random behaviour and is the reason that the error that is returned doesn't reflect the actual problem.
I see the following resolutions now:
Require more content about the file argument (e.g. with a string_content=["path","file"] argument, but that would only have a meaning for file arguments of type str (kinda ugly) (Breaking change)
As pandas.read_csv() does: Accept str | io.StringIO
types for file and always infer the str type as a Path, and the io.StringIO as file content. This would require the user to convert a string to StringIO object. (Breaking change)
Return a general custom exception (e.g. CPTParsingError) when all parsing options have failed. Although this will still return a ValueError
for an erroneous gef-file path.
I see two options now, both breaking changes:
- Require more content about the file argument (e.g. with a string_content=["path","file"] argument, but that would only have a meaning for file arguments of type str (kinda ugly)
- As pandas.read_csv() does: Accept str |
io.StringIO
types for file and always infer the str type as a Path, and the io.StringIO as file content. This would require the user to convert a string to StringIO object
I like the second one. we can set the type of filepath_or_buffer
to Path | io.StringIO
. then we dont use strings and can use the Path.is_file() method.
When a string has the form of a path separated by slashes or even a single short word it can never be a valid XML or GEF file right? All XML files need to start with a <
character, so that's easy, and all GEF files have the form of key: value
, so I think a proper heuristic would be:
Is almost certainly path if:
.gef
or .xml
<
or :
characters
The
read_cpt
andread_bore
functions have some "automagical" logic that infers the content of thefile
argument. The user can provide an object of typesio.BytesIO | Path | str
and with "engine"="auto", the content type is inferred automatically. This can result in confusing errors when erroneous input is provided.Some examples:
Providing a non-existing path results in
XMLSyntaxError
Input:
Response:
The expectation is to get a
FileNotFoundError
Providing a non-existing path and
engine
="gef" results inValueError
Input:
Response:
The expectation is to get a
FileNotFoundError
Providing an erroneous gef file results in
XMLSyntaxError
while gef can be parsed when forcedInput:
Response:
Input:
Response:
The expectation is to get an error that the gef file is invalid, and this response should be consistent no matter the value for
engine
.