duncan-r / SHIP

Simple Hydraulic model Interface for Python
MIT License
6 stars 6 forks source link

Simplification of tuflowfactory #2

Open JamesRamm opened 7 years ago

JamesRamm commented 7 years ago

Hi Would you be amenable to a few changes aimed toward simplifying parsing the tuflow control files (specifically tuflowfactory.py)? My aim is to improve readability and ease of maintenance. The initial changes I'd like to make are:

duncan-r commented 7 years ago

Hi James,

In general yes. I've never been completely happy with the setup of the tuflow classes and am very amenable to any suggestions on improving the clarity/maintainability.

I see your point on this, particularly that confusing classmethod? I can't remember of the top of my head why it was made a class in the first place? It may have functioned slightly differently at some point and was half way through a refactor.

(I think some commands can have some extra info specified between brackets, which would need to be ignored...im not certain on this)

There are 1001 edge cases in what tuflow is prepared to accept and understand - very good for users and very frustrating when you want to parse the files! I would have to check the tuflow documentation to see exactly what the rules are, although I warn that there always seems to be an exception.

I also can't quite remember why it checks the filepart types first. I guess there was a reason, or some edge cases. I will go and look through the code again and work it out.

Long story short... yes, if you've got a better way to get the loading done then please make some changes and we'll try and get it merged in.

There are a few other things in here to consider as well - such as it not currently recognising a leading '\' or '\/' as '.\' or './' on file paths, which causes some problems and I've been meaning to fix. There is the added issue with this that tuflow is planning to support linux soon so they will not be interchangeable for long I expect. If you're interested I can put together an, I expect quite long, list of bits that definitely require attention.

Thanks,

duncan-r commented 7 years ago

A couple of things to look out for when parsing: Not all commands necessarily have an '==' e.g: ESTRY Control File Auto !looks for an .ecf file with the same name as the .tcf file Stop ! stops the simulation from continuing

Some command can have multiple arguments e.g.: Model Events == evnt1 | evnt3 | evnt7

I expect you've noticed that the loader falls back on storing sections of the file that it doesn't understand to be written out as-is later on. There may be a better way of handling this, but it has been useful while slowly building the capability up over time.

JamesRamm commented 7 years ago

@duncan-r Yes I am definitely interested. I have already run up against a filepath seperator problem when running the tests as I am working on linux. I have another PR which I am holding back til my other changes are merged/rejected, which addresses this (but only for the tests! - it justs uses '/' as windows fully accepts these as file path separators, which is OK for the SHIP tests, but I believe tuflow needs to be warned at least that you are using forward slashes)

I have been in touch with tuflow about linux support as I am working on a web based cloud/cluster computing solution for tuflow, so linux support v. important for this!

For parsing the commands without '==', I am thinking of a couple of options:

duncan-r commented 7 years ago

Yep the slash problem has had me scratching my head a few times. I was going to go with just handling it nicely and logging a warning for linux users, but I'm happy with whatever.

I've never looked at PLY before, but I will have a look at the docs you've linked to. Anything that can help take away the pain of parsing the files will be much appreciated. I've purposely avoided having too many additional packages where possible as we use this library in a lot of our internal software (which get compiled to an executable and I'd rather not having them get too big). So it will be a case of weighing the advantages up. At least for the time being everything needs to be both windows/linux and 2.7+ 3.3+ compliant as well.

One key thing to bear in mind is that the loader reads the files in recursively, dropping down into any control files it finds, before walking back up again. This way it can maintain the order of the commands and understand the priority of things like duplicated commands etc. It's not necessarily the only way to do it, just the way it does it at the moment.

I am basically not precious about any of the way it's set up. I'm happy to consider anything up to and including major design changes really. I've wrestled with and re-written large chunks of the tuflow component several times and it will be good to have someone else to discuss options with :)

duncan-r commented 7 years ago

I've setup Travis and configured a basic .yml file for it. Feel free to update with any lother checks you want etc.

A lot of the file path tests are failing (only on linux), as you mentioned. I'll leave them until you've finished playing with the configuration and have it set up and working. Let me know if you want to have a look at anything.

The integration tests are failing (again on linux). I'm happy to update these once you're finished if you want. They're not pretty, just thrown together as a sanity check.

JamesRamm commented 7 years ago

@duncan-r Great for setting up travis. Would also be a good idea to set up appveyor (windows)? I will try to push my test fixes before any further stuff (Getting a bit off topic of this issue now, but hey!)

JamesRamm commented 7 years ago

small update here. This is a basic solution using ply:

from ply import lex

class Statement(object):
    def __init__(self, command=None, parameter=None, comment=None):
        self.command = command
        self.comment = comment
        self.parameter = parameter

def parse(lexer):
    command = []
    parameter = []
    comment = None
    statements = []
    assignment = False
    while True:
        token = lexer.token()
        if not token:
            break
        if token.type == 'NEWLINE':
            if command or parameter:
                command = " ".join(command).casefold()
                parameter = " ".join(parameter)
                statements.append(Statement(command, parameter, comment))
                command, parameter, comment = [], [], None
                assignment = False
        elif token.type == 'COMMENT':
            comment = token.value
        elif token.type == 'ASSIGNMENT':
            assignment = True
        else:
            if assignment:
                parameter.append(token.value)
            else:
                command.append(token.value)

    return statements

tokens = (
    'NAME',
    'COMMENT',
    'NEWLINE',
    'ASSIGNMENT',
    'PATH',
    'OPTION'
)

t_ignore = " \t"
t_COMMENT = r'[#!].*'
t_ASSIGNMENT = r'\=='
t_OPTION = r'\|'

def t_NAME(t):
    r'[a-zA-Z_][a-zA-Z0-9_]*'
    return t

def t_PATH(t):
    r'[.]*?[<>~a-zA-Z0-9_/\\][a-zA-Z0-9_/\\.]*'
    return t

def t_NEWLINE(t):
    r'\n+'
    t.lexer.lineno += t.value.count("\n")
    return t

lexer = lex.lex(debug=0)

Here, Statement is a do-nothing container which I'm using to represent the classes in tuflowfilepart.py

to use, you would pass the entire contents of a control file to lexer.input and pass that to parse:

with open('somefile.tcf', newline='') as fin:
    parts = parse(lexer.input(fin.read()))

It doesn't yet group logic blocks (treats them as individual commands), parse can most likely be improved and a few other things but is the direction I was thinking of for simplifying everything around getTuflowPart

duncan-r commented 7 years ago

This seems like a good approach. It would be nice to get rid of that horrifyingly nested if-else.

I suppose there's two considerations: 1) Trying to make sure that the regex's don't get too complicated - personally I find them a bit of a pain to debug when they go wrong when they get messy. 2) Making sure that we can maintain knowledge of things like: the logic block, as you mentioned, and piped commands etc.

Perhaps adding in some of the more complicated structures like in 2 would be good to see if it functions without getting too complex?

What are the requirements for PLY? And can't find any mention of any in the docs (I haven't had a go at just installing yet to see).

JamesRamm commented 7 years ago

It is written in pure python with no dependencies, so I imagine shouldnt be a problem for packaging?

I have added in parsing of if & define blocks & simplified the regexes. The parse function is now getting a little large (35 lines) so I'm looking at whether the yacc part of ply can be helpful. Ill push the changes to the refactor branch shortly so they can be played with.

duncan-r commented 7 years ago

That's fine, it's what I'd inferred.

Good new on the logic blocks. They've been a constant battle to get working properly, so they're probably a good test of the new loading approach.

JamesRamm commented 7 years ago

I just pushed 2bac27ab1316cb58bf5248f2ad316c0b96ce93d0 which adds the lexer I have been working on. It is totally separate from the rest of SHIP at the moment, so integrating would be the next step.

How it works

Next steps

duncan-r commented 7 years ago

Ok, there's a couple of things here for consideration I think.

The flow sounds fine, I'll try and find some time to have a look at everything later this evening and see if I can work out what's going on.

Pure comment lines have been ignored (inline comments are preserved though). It would be fairly simple to keep pure comments - this could be a switch?

This definitely needs to be included. A switch could be useful, but should probably default to keeping the comments.

It doesn't split out parameters with multiple options (this | that | other), nor commands with optional additions (The ESTRY command?).

We will definitely have to make sure this is handled and we should remain true to the setup of the original file. The library will need to have the path for the .ecf file in order to check paths etc, but any user would be frustrated if it wasn't turned back to 'AUTO' when written (at least by default), as they will have to start updating the path name every time they iterate a file version.

It should happily read commands or params like that, but just as 1 block.

lazily in the container class when requested.

Maybe this? We can then handle the best way to setup things up in the classes rather than putting too much heavy lifting into the lexer. I think we will probably want to have a TuflowLoader class still to call the lexer with so that we can maintain the single loader for all files in the library. Therefore any loading specific stuff that's needed post-parse could be done there (if not suitable in the actual TuflowPart) if that makes sense?

commands will be upper case, everything else lower case. (Since TCF files can be mixed case, original case cannot easily be preserved for the parser to work)

I think I'll have to have a chat with some colleagues about this one. If we do this we will have to be consistent - if someone creates a new TuflowPart in their script or changes a command we will need to manipulate it to match the others, probably while writing out. People can be a bit precious about formatting, but concessions will probably have to be made. I think that the best option would be to write out in the format proposed in the tuflow manual commands list; capitalised on first letter of a word for a command (left of '==), and capitalised instructions (right of '=='). Without the complexities of a lookup table or something I don't think we can know which words in the command should be all caps (IWL/XP NODES/etc), but I'm sure people could live with 'Xp Nodes' for the time being.

Example tuflow manual formats:

Grid Output Origin == AUTOMATIC
Read Grid IWL == ..\somegridfile.asc

I think I mentioned before that one of the uses for the library, for us, is to extract the files for a single model for delivery (which sometimes also involves removing any logic blocks (if it's for a specific scenario for example). The gains for this would be mitigated if we had to then go through and change the formatting of the file afterwards. Probably to the point that it would be better to do it by hand?

JamesRamm commented 7 years ago

We will definitely have to make sure this is handled and we should remain true to the setup of the original file. [..] but any user would be frustrated if it wasn't turned back to 'AUTO' when written (at least by default)

What I mean by this is the entire command / parameter will be read in (including AUTO if it is there) and therefore written back out. Just no attempt is (yet) made to parse the individual command options. (With the exception of full line comments, nothing is lost when writing back out)

if someone creates a new TuflowPart in their script or changes a command we will need to manipulate it to match the others, probably while writing out.

:+1: It is already the case that the conversion of commands to upper case happens on the way out. It doesnt have to be so by any stretch, the only thing that is not really possible is preserving any weird mixed case usage. (Commands and params could be capitalised, upper cased or lower cased easily, but if for whatever reason you decided that every 2nd letter should be upper case, or every occurrence of the word 'Map' should have a capital A or any other slightly odd stuff, that would be lost). A little extra work and it could be made configurable whether params, commands, comments and IF/DEFINE keywords should be any of those 3.

If there is a set of recommended formatting rules that can be gleaned from the manual, we could include those, if it is important that certain words/phrases have a certain case. The user could even pass these in in the form of a dictionary or something like that

JamesRamm commented 7 years ago

Just pushed 4f0a92108a1385bb4ecd800b7ddf578038a5fce0 which makes it possible to keep comment lines (and is set as the default)

JamesRamm commented 7 years ago

On case...it could (probably most likely) be a problem for filenames if the case is changed, so I need to have a think on that.

JamesRamm commented 7 years ago

Aha, as it turns out, case wasn't the problem I thought it was at all.

It is now case insensitive with no changes to what it can parse - what goes in is precisely what come out with one exception: 'IF', 'ELSE', 'DEFINE', 'END IF' and 'END DEFINE' statements are upper cased as it doesn't actually keep those keywords (just sets a flag according to the type of logic block and manually adds them on write)

JamesRamm commented 7 years ago

Here is a possible way forward. Please correct me if I have misunderstood any of the SHIP bits!

Obviously this will most likely change the public API so will again need thought to make sure its not too drastic.

duncan-r commented 7 years ago

https://github.com/duncan-r/SHIP/issues/2#issuecomment-311664881 This sounds perfect. It's exactly what the current behaviour is, which is nice.

https://github.com/duncan-r/SHIP/issues/2#issuecomment-311686795

TuflowPart is the base class for containing a line of the control file (COMMAND == PARAMETER !COMMENT)

Yes!

ControlStructure merges with/replaces some of ControlFile and TuflowLogic...

As with all of this I need to go in and look at what's going on and make some sense of it, but initial thoughts are that this sounds ok. Things to consider (that may be obvious):

tuflowfactory.py / the _readControlFile method of TuflowLoader is essentially replaced by the parse

I'm fine with this.

all the derived classes from TuflowPart are removed as previously discussed. It will need careful consideration as to what additional functionality the child classes provide ...

The unique functionality of these is pretty much down to

The import parts for these are that

I'm sure there's some things I've missed off the top of my head, but hopefully this gives the idea.

I'm going to raise the public API stuff on a separate issue.

JamesRamm commented 7 years ago

From https://github.com/duncan-r/SHIP/issues/20#issuecomment-312109731; I think the best thing is to continue developing the new functionality alongside the current functionality (rather than replacing/merging just yet), so that at some point we might be able to offer them both side by side with deprecation warnings on the current stuff to make it easier to migrate.

duncan-r commented 7 years ago

https://github.com/duncan-r/SHIP/issues/2#issuecomment-312110362 Yes I completely agree with this. I think that once we are happy with this one it may be best to actually tag the other one and split it off, like the release-0-2-5-beta branch. It can be maintained if needed from there. If this becomes the preferred approach I'd rather have it as the dev/master branch.

https://github.com/duncan-r/SHIP/issues/20#issuecomment-312109731 (continnuing implementation discussion here). I can see a big advantage for the tree structure for both the logic and the control files. In some ways we do have a - slightly strangely implemented - tree. There is the parent member of the parts. such as this from `TuflowFile.getRelativeRoots() which calls up through the parent to make sure that all relative path components are found when building the absolute path.

if not self.associates.parent is None:
    self.associates.parent.getRelativeRoots(roots)

But, it would be good to formalise any algorithm a little bit as it's kind of mixed with an observer at the moment!

JamesRamm commented 7 years ago

Commit 8f5ff3c creates the tree structure for the control files. I'm sure it can be simplified - for example, there is now both a node_type and a part_type for each node - I think these could probably be combined so there is just a part_type. It also adds the walk method, which will visit every node in the tree, which simplifies the filter method.

Some next steps:

duncan-r commented 7 years ago

Have the root node (or all nodes?) contain the root directory

The majority of files in a .tcf/.tgc/etc use relative paths. So it must be possible to build an absolute path from the .tcf root and any relative paths that "parent" files have. However they can have their own absolute path. This is actually quite common for result and check files. For instance we tend to build our models on our local machines, then transfer them/access them to/from dedicated modelling machines and have the paths write out to specific folders (this is also quite common with other models we get in too). Therefore they each need to be able to have their own absolute path if required and the path may not be findable when loading the model (to update something or whatever). I think that the lack of an absolute path on the part indicates that it will need to search up the tree for it?

Modify TuflowModel to contain the tree (we want to support the current api for now so this might have to be done as a separate class)

The majority of the public tuflow API is in the TuflowModel and ControlFile classes. I think that we should be able to replicate most of it in the new setup and add wrappers with deprecation's for the new functionality. (https://github.com/duncan-r/SHIP/issues/2#issuecomment-312110362) . We will need to have most of these methods covered anyway and I'm sure we can find a way to call the new functions with the API? (temporarily)

duncan-r commented 7 years ago

I like the idea of not holding the root locally when it uses the model root. It feels neater. It also means that you can change the root, change a control file's relative path, etc, with little effort. It will then be built on the fly when requested.

JamesRamm commented 7 years ago

For instance we tend to build our models on our local machines, then transfer them/access them to/from dedicated modelling machines and have the paths write out to specific folders (this is also quite common with other models we get in too).

This functionality I'll be using for distributed/cloud modelling. Setting the output/log paths needs to be handled automatically on the server (overriding any output paths that might already exist in the tcf). Likewise validating the input paths are relative & exist is another reason I was drawn to SHIP in the first place.

I like the idea of not holding the root locally when it uses the model root. It feels neater. It also means that you can change the root, change a control file's relative path, etc, with little effort. It will then be built on the fly when requested. I think that the lack of an absolute path on the part indicates that it will need to search up the tree for it? Yes. So, the root node will contain the root path and all children can just traverse up the tree till they find it, if needed.

I was thinking that the helper functions that were on individual parts (e.g. for constructing an absolute path) could be standalone functions...something like:

def getAbsolutePath(node):
    if node.type not in {...all the 'file' types}:
        raise TypeError(...)

    if os.path.isabs(node.parameter):
        return node.parameter
    else:
        path = node.root().model_path
        return os.path.normpath(os.path.join(path, node.parameter))
duncan-r commented 7 years ago

I was thinking that the helper functions that were on individual parts (e.g. for constructing an absolute path) could be standalone functions...

This could work ok, I think. The reason that that method is overridden so often is because there tends to be a lot file type specific behaviour. Such as for gis files you may need to know what other files should exist as well and make sure you grab them. The file line might read Read GIS Zline == ..\myfile.shp but you also need to know to search for myfile.shx and myfile.dbf as well. Some of it is to handle the piped files as well but that may be better handled now and no longer needed. I suppose I mean that the overridden methods are quite tightly coupled to the class (at the moment). Some of this functionality can also be useful elsewhere in the library as well, at least potentially. It's all just worth bearing in mind.

JamesRamm commented 7 years ago

The latest set of commits to the tuflow_refactor branch adds the basics of a new TuflowModel (just called Model) and support in the fileloader for using the new api:

from ship.utils.fileloaders.fileloader import FileLoader
model = FileLoader().loadFile(path, version=2)

Old api still accessible by omitting version (or setting it to 1). I am developing Model as an entirely new class since pretty much all the functions from TuflowModel would need to be rewritten or replaced. Nowhere near parity with original API features yet, but basics are there in (hopefully) a simpler and more succinct setup. In this branch then, I am just going to focus on making sure Model/ControlFileNode has feature parity with TuflowModel, ControlFile, and the various TuflowPart subclasses. Other API changes can then be tackled elsewhere/later.

duncan-r commented 7 years ago

This sounds really good. I'll see if I can pull it all down tonight and go through it properly. I had a bit of a look through the other day and it all seems to be looking good. I think I'll start hooking your branch up to a piece of software I have that reads lots of info in and tell you how it gets on. The version kwarg is a nice addition 👍 I'm drawn as to whether to keep the main class called TuflowModel or go with the Model choice. I partly think that all main models files (for any supported model in the API) could be called Model in it's own package for consistency. But it's also nice for users to be able to directly import the classes without clashes...i.e. keep it as TuflowModel. Any thoughts?

duncan-r commented 7 years ago

@JamesRamm A small thing at the moment, but would you mind fixing your docstring's up a little? Could you put the summary line next to the opening """ please?

You may have noticed but the library uses the google napolean style docstrings: http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html

Sphinx gets a little funny if they aren't consistent :)

JamesRamm commented 7 years ago

Yup I'll fix up the docstrings. I've continued api thoughts in #20

JamesRamm commented 7 years ago

@duncan-r I'm trying to consolidate some ControlFile functions with the new api. Currently looking at addLogicPart. If I understood correctly, I think append_statement is analogous to addLogicPart, although it deals with appending any basic command == parameter !comment node.

I don't see any useful way to maintain the same function signatures for any function for adding parts to the tree. Also, there probably needs to be a specific function for adding logic blocks, since they require a specific creation order of nodes, the user shouldnt deal with creating ControlFileNode instances themselves.

so something like appendLogicBlock, which would create the parent IF or DEFINE node with the initial statement and the END node. It would return the parent (IF/DEFINE) so you could then append the inner statements.

So a function signature like:

appendLogicBlock(logic_type, command, parameter, comment=None)

so you could do:

node = appendLogicBlock('IF', 'command', 'parameter', 'my comment')
node.appendStatement(command, parameter, comment)

Does this seem reasonable?

appendStatement I imagine is going to have to somehow handle whether to insert an ELSE IF, ELSE or nothing.

It looks like we will be able to maintain somewhat-similar functions at a Model level but it will be drastically different at the control file level

duncan-r commented 7 years ago

@JamesRamm sorry for the slow response. I've had a a few long days and no time to put to this. I think that what you're saying sounds reasonable. In general I very much doubt that this is used much in the public API. It will mainly be used internally to build and retrieve from the logic structures. My thoughts are that: if you can make it work will with your new setup and make the public API more understandable in the process it's a win-win.