JacquesCarette / Drasil

Generate all the things (focusing on research software)
https://jacquescarette.github.io/Drasil
BSD 2-Clause "Simplified" License
136 stars 25 forks source link

Generated code: Inputs & Derived Values should be validated before any usage #3829

Open B-rando1 opened 3 days ago

B-rando1 commented 3 days ago

I was looking at something with @balacij and @NoahCardoso, and we noticed something a bit off. In all of our current input patterns, we follow this pattern:

  1. Parse all inputs
  2. Calculate derived values (where derived values are intermediate values with constraints)
  3. Validate inputs and derived values
  4. Intermediate calculations
  5. Output important values

As @balacij pointed out, we should not be calculating derived values before validating inputs, for both security and performance reasons. This is actually also a problem with logging: we shouldn't print a value to the console or a file before we know it is a valid value.

There are two solutions to this problem:

The other change we'll need to make is to tie logging to constraint-checks rather than to parsing.

I'll try to make a follow-up with more technical details tomorrow, but I believe that most of the changes we need to make will be in Modules.hs, as well as the readData function in Import.hs.

As a side note, I wonder if we should separate 'safety' constraints from 'modelling' constraints - we need to be much more careful about when and where we check for malicious inputs than inputs that are harmless but incorrect.

JacquesCarette commented 2 days ago

Excellent observations! Strongly agree.

B-rando1 commented 2 days ago

As I mentioned, we want to move logging from when we parse inputs to when we validate them. I looked around a bit to see what this will take.

Currently, logging seems to be done in the readData function, specifically lines 647, 648, 652, 654, 657, and 668.

If a chunk has no constraints associated with it, then we can leave it there (technically every chunk should have security constraints, but our current examples don't even have security constraints, so we need to cover the possibility that there are none).

If a chunk does have constraints associated with it, we'll need to move its logging from this function to the constraints function. After looking around, I believe that the function we want to modify is chooseConstr.

I also noticed that we're printing out the value of inputs that have been proven to be invalid, using the constraintViolatedMsg function. I think this exemplifies what I said above that we might want to look at separating safety and model constraints. It makes sense and is very helpful to print out an input's value if it is out-of-bounds, but printing it out if it has been determined to be unsafe is a terrible idea.

I'm not exactly sure where we should start with this part of the issue. We can start moving validation closer to parsing, but it almost feels a bit too early to do some of this, when the safety constraints we're talking about are purely hypothetical.

I guess the change that does make sense either way is moving constraint checks for inputs above calculating derived values. If an input is out of bounds, you want to know as soon as possible, not after calculating some intermediate values that won't end up being used. This may be a more difficult change to implement, however, as it deals with adding functions and moving function calls around. I'll take another look at some point to try to get a grasp on what it'll take.

JacquesCarette commented 1 day ago

readData does way, way too much. One thing to consider is to break it up into smaller pieces, and push the decisions upstream.

balacij commented 9 hours ago

Thank you for writing this up @B-rando1 !

In readData, appendTemp and clearTemp can probably be removed in favour of using fmap. I'm sure there are other ways to make the code more readable too.