duncan-r / SHIP

Simple Hydraulic model Interface for Python
MIT License
7 stars 6 forks source link

Improvements to tuflow/datafiles package #15

Open duncan-r opened 7 years ago

duncan-r commented 7 years ago

These modules were never fully updated after the restructure of the rest of the tuflow package.

The datafileloader.py classes often fail while loading bcdbase/tmf/etc file contents for a range of reasons, usually to do with misunderstandings in the flexibility that TUFLOW allows for these file configurations. This is particularly true for the subfile loaders (i.e. readMatSubfile()).

Some thoughts on the design of the loaders and potentially the ADataFileObject type classes are needed. The loaders in particular are overly complicated, hard to test and hard to maintain.

The biggest problem here is getting the data in to a format to able to write the files back out again, rather than read them in. TUFLOW allows users to either order columns with any header text in a specific sequence, or to provide specific headers in any order. Discovering this and maintaining the text of any original header is complicated. Perhaps it would be better to just work out what the columns contain and always use default header text when writing them out?

In addition to this __event__ style variables need to be considered as well.

See TUFLOW Manual.2016-03-AA.pdf - section 7.5 on page 7-28 for more information on BC Database file configuration.

JamesRamm commented 7 years ago

I think that guaranteeing a correct output format is more important than preserving the input format. A quick look at the source suggests to me that perhaps there are too many custom classes & functions where the csv module and standard containers such as OrderedDict (or just a list) may be leveraged instead. It is dependent on what functionality is required for interacting with these CSV files (above and beyond what csv module can do), which isn't completely clear to me yet.

duncan-r commented 7 years ago

Completely agree. It's a bit of a mess. Also agree that standardising output is a good idea. I always tried to be able to return the files to the exact setup they were when read in, but this definitely adds complexity and may just be over complicating matters. We could even just store the original headers so the user can easily compare them if required.

Summary of the existing workflow:

The csv module is used a bit and I'm sure could be better utilised. The complexities come in the way the some of the files are organised. A good example of the worst case is the materials.csv file (TUFLOW Manual.2016-03-AA.pdf - Table 6-13 on page 6-49) which can contain different types and different numbers of values in a specific column, which need to be handled in different ways. For example column 2 (mannings):

           if len(splitval) == 1:
                if uuf.isNumeric(splitval[0]):
                    new_row[1] = splitval[0]

                # Otherwise it's a filename. These can be further separated
                # into two column headers to read from the sub files.
                else:
                    strsplit = splitval[0].split('|')
                    if len(strsplit) == 1:
                        subfile_details[strsplit[0].strip()] = []
                        new_row[6] = strsplit[0].strip()
                    elif len(strsplit) == 2:
                        subfile_details[strsplit[0]] = [strsplit[1].strip()]
                        new_row[6] = strsplit[0].strip()
                        new_row[7] = strsplit[1].strip()
                    else:
                        subfile_details[strsplit[0]] = [strsplit[1].strip(), strsplit[2].strip()]
                        new_row[6] = strsplit[0].strip()
                        new_row[7] = strsplit[1].strip()
                        new_row[8] = strsplit[2].strip()

            # If there's more than one value then it must be the Manning's
            # depth curve values (N1, Y1, N2, Y2).
            else:
                new_row[2] = splitval[0]
                new_row[3] = splitval[1]
                new_row[4] = splitval[2]
                new_row[5] = splitval[3]

The entries can also contain references to additional files that may, or may not, need to be loaded at that time.

In addition to this, any files and some variables can be referenced using the event data variable set in the control files. Therefore '..\mymanningsfile.csv' may actually be '__eventfile__' or similar in the data file.

The columns don't necessarily have to be in the order specified in the manual. In my experience they're often not, but TUFLOW seems to handle this fine. ...and I'm sure some stuff I haven't worked out yet.