GEMScienceTools / oq-mbtk

OpenQuake Model Building Toolkit, a suite of tools for building components and analysing models for PSHA.
https://gemsciencetools.github.io/oq-mbtk/index.html
GNU Affero General Public License v3.0
16 stars 8 forks source link

column names hardcoded in fault modeler #66

Open cossatot opened 6 years ago

cossatot commented 6 years ago

This issue pertains to work in progress on the fault modeler, in the 'fault_conversion_utils' branch that has not yet been merged into master.

@klunk386 @julgp @mmpagani please review:

The primary problem is that non-default geojson column names are being hardcoded into the fault_source_modeler.py script, in the param_map dictionary[^1].

The new library was written so that project-specific settings that are not default (things such as column/attribute names) are not hardcoded in, but are specified in a project configuration file. This was done because attribute names can and will change between projects. There is no reason to force everyone who uses this library to have a specific set of attribute names, or to prevent us from changing the names as the fault database project evolves.

By hardcoding these values in the fault_source_modeler.py script, this repeats the old problem with brittle code, after I re-wrote the library to fix this problem. I don't understand why it is necessary or desirable.


As a side note: My opinion is that the fault_source_modeler.py should take as its primary argument a project configuration (.ini) file, that contains all non-default parameters (not only column/attribute names, but any non-default parameters including scaling relationships, seismogenic depths, rupture mesh spacings, temporal occurrence models, or anything else. The GIS input and XML output files would also be specified here. This prevents all of these parameters from having to be a) hardcoded into the fault modeler each time, or b) having to be passed as command line arguments each time. It is very hard to keep track of all of these parameters, especially ones only passed on the command line, because you have to search through your bash history to find out what you did. Good luck with that after a few months or a few years! Also, version control can be used so that the parameters are tracked with the output file. This is impossible with the current system.

Also, It is no more work (it is less work, in fact) to write them once into a .ini file than pass them on the command line.

An example .ini file is below

[config]
fault_gis_file = ./ne_asia_faults_rates.geojson
outfile = ne_asia_fault_model.xml

[param_map]
average_dip = ns_average_dip
name = ns_name
strike_slip_rate = ns_strike_slip_rate

[defaults]
rupture_mesh_spacing = 0.5
lower_seismogenic_depth = 15
name = 'no_name'

Then, the fault_source_modeler.py file reads this file and does all of the necessary work based on the parameters here. I have provided a preliminary working version of this script to @klunk386 and @julgp as an example.

Then, the XML file is built with fault_source_modeler.py config.ini instead of passing all of the arguments as command line arguments. I know this is different than was done in the past but it seems to me to be simpler.

Just to be clear, Values only need to be written for non-default parameters, not for all of the parameters needed for a project other than the input gis file name.

If we want to discuss this further (and we should, as a group, come to a decision) I can make a separate issue for this topic.


[^1]: These column names are found in the early versions of the fault database, and are inherited from the old Faulted Earth project. They are not used in the current Global Active Faults database and are not in the current CCARA faults, for example. However, because the old scripts used to make the .xml files were brittle and had these values hardcoded, I did not update those column names in the geojson files we were using for hazard modeling because this would break the scripts. However, these column names have been simplified to make them less confusing.

klunk386 commented 6 years ago

I start with replying to the second issue. Using an .ini file is certainly a good idea (indeed I was going to include also this possibility), but should not be the only possibility. For those people (like me) that prefers to call the arguments of the code programmatically (e.g. through a bash script, a python notebook or interactively using ipython) having to deal with an ini file is impractical, so there should be always the option to pass all arguments as in-memory variables (which can be modified run-time e.g. in case of sensitivity tests).

klunk386 commented 6 years ago

About the first comment, I fully agree that the code should be capable to handle non-standard header information. Indeed, I've also included the param_map variable in the class for parsing the geojson files (fault_database.import_from_geojson), and I will add it as well to the generic interfaces (build_fault_model and build_model_from_db ) as external argument or (as you suggested) as an ini file. However, I think that this functionality should be implemented just in the part of the code that interfaces with the geojson, but not in any single function, as it is done now. This is highly redundant, practically quite inefficient and probably useless for the standard user. If your idea is to give the code more flexibility, I think this is a bit of overengineering, as the header" translation of the fault_dict can be done just once using a separate function (or, as I said before, at parsing time). The single functions, then, will simply use internally the default format. Cheers

klunk386 commented 6 years ago

I have applied all of the above in the new version of fault_source_modeler.py (for the moment only as module, still need to modify the flags for the bash interface). There is just one issue I cannot solve: I would pass some parameters (e.g. tectonic_region_type, M_max) as defaults (like in the ini), while in the implementation of Richard these are extra-keys of the fault dictionary. Actually, I do understand the motivation of Richard (these parameters might be variable between faults), but for the generic interface I would prefer to pass these as defaults. Still working on this....

cossatot commented 6 years ago

However, I think that this functionality should be implemented just in the part of the code that interfaces with the geojson, but not in any single function, as it is done now. This is highly redundant, practically quite inefficient and probably useless for the standard user. If your idea is to give the code more flexibility, I think this is a bit of overengineering, as the header" translation of the fault_dict can be done just once using a separate function (or, as I said before, at parsing time).

For the parameter mapping (not the defaults), I think it is fine to put the parameter mapping/translation in an interface rather than in the library code. When I wrote the library code, I had to build a first version that incorporated this functionality directly into the library because there was no interface yet. I also wanted to make the interface as simple as possible and move all of the complexity into the library. But either way is fine with me

I don't think it's computationally less efficient--this is exactly the same as looping through all of the parameters for all of the faults when the translation is done. Classes in Python are just big, complicated dicts, so each time you access an attribute of a class you are doing dictionary lookups through a few layered dicts anyways. Regardless it's a very quick process (I just timed a dict lookup at 36 ns, and an attribute lookup at 40 ns] so it really doesn't matter from an efficiency standpoint how we do it.

However we need to make sure that this is only done in one function somewhere. You mentioned having several interfaces in a comment above, and obviously you don't want to have to maintain a version of this code in each interface. This is one reason to keep it in the library, but if we are smart about how the interface code is written it's not a problem to have it there.

cossatot commented 6 years ago

There is just one issue I cannot solve: I would pass some parameters (e.g. tectonic_region_type, M_max) as defaults (like in the ini), while in the implementation of Richard these are extra-keys of the fault dictionary. Actually, I do understand the motivation of Richard (these parameters might be variable between faults), but for the generic interface I would prefer to pass these as defaults. Still working on this....

As I wrote the code, it already did all of this.

I would pass some parameters (e.g. tectonic_region_type, M_max) as defaults (like in the ini), while in the implementation of Richard these are extra-keys of the fault dictionary.

This is incorrect. These are project default values and are only overridden for specific faults if those keys exist in the fault_dict's attributes. If the fault_dict doesn't contain those values, the default values are used. Just update the default dictionary with whatever you'd like and make sure to pass the dictionary to the construct_sfs_dict function. The default dictionary applies to everything in the project. It's not part of the fault_dict and is not specific to any fault.

For more information, see how the fetch_param_val function works: It checks the fault_dict dictionary and if the key isn't present, it checks the defaults dictionary.

If there are problems with this, it may because you have changed the construct_sfs_dict function by putting the default_update keyword argument instead of the default keyword argument. The way that I wrote the library code, the default dictionary is updated once at the initialization of the project. It has to be passed back to the construct_sfs_dict function even though it's a global variable because of the way that the scoping of variables works inside of modules.

klunk386 commented 6 years ago

As it was before before my changes, passing a default dictionary with only one parameter to be updated was causing the whole hardcoded dictionary to be overwritten, losing all the other parameters and creating an error. With my modification, only the user-provided key is replaced, while all other defaults setting are preserved. I suggest you to use different names for the hardcoded dictionary and the function argument. In my modification, I have changed the name of the argument, but you can alternatively change the name of the hardcoded dictionary, to prevent overwriting, and then update.

cossatot commented 6 years ago

As it was before before my changes, passing a default dictionary with only one parameter to be updated was causing the whole hardcoded dictionary to be overwritten, losing all the other parameters and creating an error.

This is also incorrect. In the script I provided that demonstrated how to use the library, both the defaults and param_map dictionaries are initially generated in the library/module and then imported into the script, updated with the new information from the .ini file, and then passed back as arguments to the function. They are not overwritten.

This may have been unclear in the example script because they were imported with everything else (using from fault_conversion_utills import *) and I didn't comment all the lines in detail.

I apologize for not providing enough documentation and writing code that was unclear.