Load from CSV - Githubissues

kwinkunks commented 3 years ago

Needs improving. For example:

Less strict about column names etc. Allow to use_cols like NumPy/Pandas
Allow to load from URL
Encodings
Handling bad intervals (ignore or error), nulls for missing zones

mtb-za commented 3 years ago

There are a number of cases that should probably be handled:

Only tops are given - bases are inferred to be the next top.
Only bases are given - tops are inferred to be the next base.
Both bases and tops are given
Either bases or tops are given along with a thickness - the missing value is calculated using the thickness.

Currently we can handle the first and third cases. The second should not be too difficult, and if that is working both of the fourth case become essentially analogous. This probably should happen when we build a list of intervals, rather than being something that the from_csv method handles specially. This will let other from_* methods to do the same.

One major change that I am making is to explicitly require a top, base and/or thickness column to be specified, unless names=True is passed, in which case it should find them automatically. We are still assuming that there is a component column or similar exists, which can be used to define those for the interval.

We still need to think about the possible things in an Interval object, to decide what we are going to give to the Interval constructor:

Top - self-evident
Base - self-evident -- if one of these are missing, we can infer from the next one, or possibly a thickness and one of these.
Component - A list of Component objects.
Description - Plain-text description that gets parsed into a list of Components. Probably this is what we are going to mostly use, when reading a CSV? Do we need to get a number of Components somewhere? That feels like a tricky problem though.
Data - Anything else gets added to this dictionary.

mtb-za commented 3 years ago

https://gist.github.com/mtb-za/3f94ffc426e804e7b2c778c2f0c6f051 has a couple of approaches, one using np.genfromtxt and one using csv.DictReader. Not sure if one appeals to you more than another one.

If we want to get input as something other than strings using either base approach, we need to cast them. We can start with the most specific type: int, then try float, and finally leave them as str. We might be able to handle other things, but that might be tricky to decide what that needs to be cleanly. genfromtxt allows for a sequence of dtypes, which is probably possible, but more difficult with csv.DictReader.

agilescientific / striplog

Load from CSV #128