LimnoTech / HSPsquared

Hydrologic Simulation Program Python (HSPsquared)
GNU Affero General Public License v3.0
2 stars 0 forks source link

Rewrite readWDM.py to read by data group & block #21

Closed aufdenkampe closed 3 years ago

aufdenkampe commented 3 years ago

From Jack in Feb 5 email:

I have been investigating bugs related to WDM data with variable timesteps. My review of the readWDM code that loops through WDM data Groups (after comment #Get timeseries data) with calls to 'getFloats' has led me to believe that 'getFloats' should be replaced by a new routine called something like 'getTimeseriesDataGroup'.

The new routine would do the following:

  1. check to see that the Group date word matches the expected date.
  2. loop through WDM Blocks within Group
    • if Block timeunit/timestep does not match HDF5 timeunit/timestep then adjust number of values in Block for use filling array to be written to HDF5. Then process into HDF5 data array like compressed WDM data
    • if Block is compressed, add multiple identical values to HDF5 data array.
    • if Block is uncompressed, march through WDM record copying data values into HDF5 data array.

There need to be robust optional debug messages written to a debug file as there can be subtle issues related to midnight, new year, leap year and end of WDM records.

Code from atcWDMvb routine TimeseriesDataFromWdm can be used as an example of how the new routine might work.

Paul D. thinks: "This vb code seems to handle your complex WDM data correctly, so if we could just translate this vb to python we should be golden."

Last, @ptomasula created this EFDC fortran binary reader in Python, which could be worth following as an example designed to efficiently create Pandas data frames. https://github.com/LimnoTech/LTPy/blob/master/ltpy/ltpy/model_tools/efdc/binary_reader.py

aufdenkampe commented 3 years ago

Connected to #20 for info, #22 for testing, 23 for reading into Pandas data frames and writing outputs to Parquet.

aufdenkampe commented 3 years ago

Update: LimnoTech & Respec met Feb. 10 and outlined to rewrite the getfloats function in readWDM.py by mimicking the TimeseriesDataFromWdm in the BASINS atcWDMfile.vb script, which does not rely on a Fortran DDL file.

@jlkittle pointed out that this WDM debugger tool would be very helpful: https://github.com/respec/FORTRAN/blob/master/lib3.0/BIN/WDMRX.EXE.

@PaulDudaRESPEC shared the scanned "WDMProgrammersGuide.pdf", which I'll post somewhere in this repo.

ptomasula commented 3 years ago

Background

An initial rewrite of getfloats was implemented as the new function process_groups. The process_groups function is based off the logic in TimeseriesDataFromWdm from RESPEC's BASINS code, which processes timeseries as groups and blocks. Each group contains one or more blocks, which are structured data records. Each block begins with a control word (a 32bit integer) that contains information about the block's structure, and then continues with one more 32bit float values based on the number of values parameters specified in the block control word. The logic was implemented in commit c5b2a1cdd6a55eccc0db67d7840ec3eaf904dcec and was able to successfully read and convert two testing WDM files: rpo772.wdm and the test10b file TEST.WDM. However additional testing with RPO_SWMM48LINKS2017_wCBOD_June2019.wdm revealed two issues discussed further below.

Numpy Array IndexError

The irregular timestep means that the exact dimensions required for a numpy array to hold the converted timeseries could not precisely determined prior to processing the timeseries groups. Allocating a numpy array requires a fixed size, and misjudging that size results in an IndexError (which also isn't handled by the current code) and the file processing to fail. The current solution, which was committed as bad3771d4d6c06e0b2aefbd322273c990a025021, allocates the numpy array in chucks and expands the array when processing the next block would store values that exceed the current boundary of the array. However this solution is not ideal. Arrays are not really extendable. Reallocating the array, even in chunks, means copy the previous array to a larger array in a new block of memory. This drastically impacts performance and the impact of this approach will be felt even more with larger files.

Next Steps:

  1. Check with @PaulDudaRESPEC and @jlkittle to see if there is a method to determine the number of data values in a timeseries prior to processing the groups.
  2. If it is not possible to determine number of values prior to processing, consider replacing data_array & value_array with a more suitable data structure (e.g. python List)

Blocks Ending on 511th Element of a Record

Timeseries 134 of RPO_SWMM48LINKS2017_wCBOD_June2019.wdm contains a block which end on the 511th element of a record. The previous logic tried to parse the 512th element of that record as a block control word, however the 512th element was not a valid block control word. This would cause issues in processing the remaining blocks in the timeseries. Working on the assumption that blocks are always at least two words long I implement logic in commit 129a228922539c524fcfe7c90ec65b9b19291091, which moves to the next record in the timeseries chain of records when a block ends either on 511th or 512th element of the record.

Next Steps:

  1. Check with @PaulDudaRESPEC and @jlkittle about the significance of the 512th element when a block end on the 511th element of a record.
  2. Modify the code if the above assumption proved to be inaccurate.
aufdenkampe commented 3 years ago

Looping in @bcous for notifications.

htaolimno commented 3 years ago

I've made some good progress based on Paul's previous work and VB code. I added an alternative implementation of process_groups() function (commit 0f5a2d1f4fef08aee5858a9f7ec5af47e55f8aa7). The two major changes I made from Paul's version are:

  1. used lists to replace numpy matrixes; Lists seem to run faster on reading rpo772.wdm file;
  2. added a loop to iterate through each group and used ending date as the ending condition. VB code was implemented this way.

    All three of these WDM files can be read successfully now.

    • rpo772.wdm
    • test.wdm
    • 2_RPO_SWMM48LINKS2017_CBOD.wdm
aufdenkampe commented 3 years ago

Moving excerpts of conversation from Teams on Mar. 1, 2021:

[Monday 10:10 AM] Paul Tomasula

Thanks Hua, excellent work on this! I reviewed the code and it looks solid to me. Have you done any comparisons between the converted time series with those from the BASINS code?

Ideally we automate this comparison as a test in the future, but for now spot checking a few time series by hand should do. I'd mostly be interested in making sure some of the time series in the RPO_SWMM48LINKS2017_wCBOD_June2019.wdm match since that file has the funky block ending on the 511th element of a record.

[Monday 10:24 AM] Hua Tao

Now that I remember that dates in the VB generated csv file is Julian Year. You must convert them back to normal date/time format. I can point you to those. I placed the two folders of csv here: S:\DOESBIR2\HSP2-Dev\WDM-TestFiles

[Yesterday 11:12 AM] Paul Tomasula

Hey Hua, I did some spot checking on times series and values are near identical. The run time is considerable though. I'm going to take a crack at some refactoring. I'll primarily focus on getting numba implemented and we'll see how it looks from there.

aufdenkampe commented 3 years ago

Moving excerpts of conversation from Teams on Mar. 3, 2021:

[9:03 AM] Paul Tomasula

Refactoring and Reader Performance

Hua and Anthony, I did some general refactoring of the WDM reader yesterday to tidy up the code (remove deprecated functions and irrelevant comments, clarify variables through better naming, raise exceptions in place of print statement for errors, etc). (Commits b600ebf3689ab01391a1b4c7b1b9a747c13c2372, 256186994681af73ccfbd94e381d3309ae1d1cbb, 2e205f12e1da856f90830b17b0d6a69aecf48a7f). I also tried to implement numba, but was ultimately unsuccessful.

Numba implementation is best accomplished through the @njit decorator, which performs just in time compiling of a function without using python structures. When done correctly, this will allow python code to approach the speeds C or Fortran with a little extra runtime for compiling. However numba also has rather restrictive data type allowances and is particularly restrictive with datetimes. The only supported structures are numpy's datatime64 and timedelta64 objects, and even then it's not a full implementation. For example creation on a datetime64 object in not presently possible within a numba function. Numpy's datetime object is also extremely difficult to use if you have a variable unit for timedetla (i.e. some in years, others in day, minutes, etc) like the WDM files do. Years and months are nonlinear so there's no easy way to convert those units to a timedelta in seconds. I did start to think about a work around to manually convert years and months. The implementation will be pretty messy though and would reduce code readability and maintainability. Also the complexity of the workaround doesn't guarantee a runtime boast because it needs to run a bunch of extra calculations to convert years and months to days.

I guess my question is, what do you all think out about this? The runtime is still considerable. It took my machine over 20 minutes to convert the RPO_SWMM48LINKS2017_wCBOD_June2019.wdm to an HDF file. Should we leave it at this? Any other thoughts to bump up performance? Parallelization will help a little bit but we'd still be looking at a considerable runtime.

[9:32 AM] Hua Tao

Paul, I never used Numba so I don't have a lot insight about it. The code that we developed to read WDM files are pretty simple. The only part that might be time consuming is probably those date/time calculation in Pandas. The rest of the code is very primitive calculations and loops. I don't know whether and how much Numba is able to recompile and speed up those. I suspect that the writing a HD5 file (hundreds Meg) to disk is probably the most time consuming part. If we only keep the WDM data in the RAM, it is probably not too bad. We can strip the HD5 out from the reader and make it a separate ExporttoHD5 method() and only call it when needed.

[9:35 AM] Anthony Aufdenkampe

Paul, I'm wondering if we follow Hua's suggestion to skip use of Numpy until the very end of the ingestion process, or at the very least, don't convert the datetime from Fortran (isn't it just a float?) until after all the looping is done. That way you can use Numba.

[9:51 AM] Paul Tomasula

I couple of quick response though. I isolated the process_groups method for testing and just running that I average about 3 seconds per time series. The RPO...June2019.wdm file has 432 time series so its 21ish minutes for executing that step. This definitely appears to be at least one of the bottle necks. The writing could also be pretty substantial though I haven't tested that.

The issue with isolating the datetime conversions is that's currently integral to the processing loop. We use the datetime to indicate when the loop needs to stop executing. Even putting that piece aside for now. The only datetime information stored is in the block control word and it specifies the timedelta for each value. This means that each timestep is dependent on the previous timestep in its calculation. So we could maybe store the timedelta information to have a numba compatible loop but I'd imagine looping through a second time to convert the timedelta to a timeseries is probably slower than our current approach.

[9:57 AM] Hua Tao

Paul, then I would think parallelization would help improve speed. There is no dependencies between time series. This approach is a low hanging fruit, I would think. What do you think ?

[10:00 AM] Paul Tomasula

Yeah I agree. We definitely want parallelization added in as well regardless. Maybe as a first step we implement parallelization and see what performance looks like? Then revisit refining the processing method only if necessary.

[11:43 AM] Anthony Aufdenkampe

​​Paul, that's great that you were able to profile the typical speed for each time series. I agree that 3 s per series sounds fast, but perhaps you could speed it up to 0.3 s or faster by using Numba. Let's try get the time stamp conversion outside the loop.

Hua, what is the total run time of the VB code to run that same RPO...June2019.wdm file? Knowing that would be very helpful.

Paul, when its time to parallelize, I would really like to go with Dask for this and for the rest of the rep. It looks relatively simple according to this tutorial: http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/dask.html. Also, Dask also provides Dask Dataframe, which is very similar to Pandas Dataframe but has built-in parallelization and also allows for very large datasets that are bigger than RAM. We might want to use these instead of Pandas dataframes throughout the repo.

[12:02 PM] Hua Tao

VB.NET uses Just-In-Time compiler so performance wise it will be worse but comparable to FORTRAN. I haven't profiled performance for this particular reader. I would think it's 10 to 20 times faster than Python based on my past experience. It's great that you suggested going with Dask. There are many options. This saves us time.

Anthony Aufdenkampe

Hua, great! Could you try timing the VB.net code today or tomorrow?

​[12:22 PM] Paul Tomasula

Thanks Hua, Sounds good! I need to change gears for this afternoon but knowing the runtime for VB.NET will be helpful to see the level of performance we're aiming for. In the meantime I'll start thinking about if it is possible to pull the time conversion out into a separate loop or some other work around. Our call with Jack on Friday might also offer up some clarity. If we can figure out how to decouple datetime from the loop control (using an indexing approach or something similar) then it might be possible to perform those conversions in a later step.

[12:42 PM] Hua Tao

VB.NET: half second on average for each time series. Yes, only 1/2 second each time series. This is still in debug mode. Release mode will be even faster.

​[1:56 PM] Anthony Aufdenkampe

Hua, thanks for timing it! So ~6x faster in debug mode, and faster once compiled. That's our goal! I suspect we can get close to that fast if we can use numba, and faster once we apply Dask.

Paul & Hua, let's chat sometime before Friday to figure this out. Remind me how datetime is stored in the WDM again? Maybe we emulate that.

ptomasula commented 3 years ago

@aufdenkampe, datetime is stored within the groups structure. The first line of a group contains the start date of the group, followed by a series of blocks. Each block starts with a block control word followed by one or more data values. Through bit manipulation, the block control word is parsed into information about the block's structure including the number of values in the block, the timestep, and timestep unit. In this way you can increment the previous date (or group's start date if you are processing the first block in the group) to calculate what the date is for value you are currently reading.

aufdenkampe commented 3 years ago

@ptomasula, thanks! I was thinking it was something like that. I also was glancing over the WDMProgrammersGuide.pdf and saw a description on page 101 (Appendix B), below.

So, within every time series data block, the BLOCK START date time is stored as a series of integers (after decoding from bits), and the BLOCK TIME UNITS and BLOCK TIME STEP are also stored as integers.

Why don't we convert these times as decimal days (floats) after a certain time, similar to the VB code? We would then be able to do the translation into Pandas date time in one step after everything is read.

@htaolimno, what is the start day for the time in the CSV files you exported using VB. In other words, what date corresponds to a value of 37987? In Excel January 1, 1900 has a date serial number of 1. If these were designed for use in Excel, then 37987.000 = January 1, 2004 at 0:00 hours (midnight).

image

htaolimno commented 3 years ago

@aufdenkampe Datetime in WDM is encoded in a 32bit binary. See the image below: image In the VB code, Julian years are calculated. The value is same as Excel, January 1, 1900 has a date serial number of 1

ptomasula commented 3 years ago

@aufdenkampe @htaolimno I gave this some more thought last night and came up with a conceptual design for a numba compatible date converter. This approach hinges on numpy arrays so we'll need to figure out if there's a way to determine the size of an array before processing a group/timeseries. We also need for the block processing loop control to be independent of datetime. Perhaps we can go back to the frepos approach.

Assuming we can meet those conditions, this method involves tracking the time offset from the epoch (1970-01-01T00:00:00). While reading the blocks, separate arrays will be used to store the offset of a given value in each time unit. Later we can apply those offsets within a numba compatible function the generate a date array. See the attached notebook which tests the conversion of 600K datetimes (comparable to the length of the timeseries in RPO testing files). The initial run plus compilation takes about ~0.5 seconds. After compilation it runs in ~0.01 seconds on my machine.

I think there's still some questions to answer to see if we can meet the conditions required to implement this method, but I think this is the direction we'll want to head if possible.

NumbaDatetimeApproach.zip

aufdenkampe commented 3 years ago

@ptomasula, thanks for thinking through and testing that approach. It definitely looks promising.

Regarding my suggestion in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-790113950 above, I'm aware that the date time is encoded as a binary, which is decoded into sets of integers for year-month-day-hour and in some cases also minute-second.

My thinking is that we collect these datetime stamps in one of those formats (a byte object or a list/tuple of integers) and then convert it later after the loops are completed.

Alternately, we do the initial conversion to decimal day after 1/1/1900 inside (a float) the loop, using the same algorithm as the VB code.

This allows us to avoid the needing the special datatime64 data type and preallocating the size of the Numpy array until after all the looping is done.

aufdenkampe commented 3 years ago

I just created, using my wife's Acrobat Pro, a searchable version of the WDM guide that also has bookmarks for sidebar navigation. This should help a ton. See WDMProgrammersGuide+Search+Nav.pdf

ptomasula commented 3 years ago

@aufdenkampe Thanks for the searchable PDF. This is most wonderful!

That is an interesting thought about using decimal day. I tried a similar approach early on, but the hurdle was how to deal with timestep units of years and months. I don't know that I have a great method for determining the current month of a decimal day, which is critical in determining how many days to add when incrementing by month. Also leap years are difficult to handle. The BASINS code uses a function TimAddJ which I assume handles these compilations. I cannot seem to find the actual code for that function though. Any chance you can work your GitHub magic to locate the function definition?

Though I also like your suggestion about using some type of bit format to store date information. That solves the annoying issues around using the datetime64 object like the implementation my previous comment. This way we can perform the date incrementing inside of the main block processing loop and still use date as the loop control. The main challenge with this approach is balancing the runtime for the conversion. To my knowledge, Numba doesn't support the creation of any type datetime objects, so we'd need to handle that conversion with some alternative method. On the upside, in this approach the conversions would be completely independent from the previous timestep so we can vectorize it. That's probably fast enough on its own without the need to leverage Numba so I think this idea has considerable merit.

ptomasula commented 3 years ago

@aufdenkampe @htaolimno, I just added commit e5d64a1 which applies the bit date handling approach we discussed above. This also supports numba for the _process_groups function.

The runtime is still considerable at about 4 seconds per timeseries but the bottle neck has been better contained and moved to a single conversion to datetime step. The actual _process_groups function runs in ~0.08 seconds on my machine. The conversion to datetime is ~3 seconds and then ~1 second for writing to HDF. The current datetime conversion approach uses the pandas apply function with a lambda which is probably not the most efficient approach. I'll work on vectorizing it.

htaolimno commented 3 years ago

@ptomasula I think all the Julian date related utility functions are in the basin modDate.vb .Also you may find Python implementation of Julian date from this link.

ptomasula commented 3 years ago

Thanks @htaolimno. I think these are good references. We'll have to see it it's necessary to implement these. In theory we can just continue with the bit datetime approach I've implement. Once we have the dates as Datetime object we can you this trick to get a Julian Date. Datetime.toordinal() + 1721424.5

ptomasula commented 3 years ago

@htaolimno @aufdenkampe. I've been trying to think of good way to speed up the datetime conversion. Unfortunately I couldn't find a great pandas based approach, because I cannot find a way were the first step is to not to use the bits_to_date on the series. To apply that function requires using the apply or map functions from pandas, both of which are faster than looping but not fast enough. So I went back to a numba approach. Numba doesn't support the creation of a datetime64 object, but it does support datetime arithmetic. Commit 1b52a17 implements an approach which calculates the offset from the unix epoch and the applies the timedelta64 object to calculate the date as a datetime64. This is a similar approach the the one I took in one of my pervious comments.

This solution is not as elegant as a pandas based one, but I think it'll get the job done. My testing showed method averaged about 0.4 seconds for datetime conversion on my PC. This is fast enough that the reading process is now IO bound.

aufdenkampe commented 3 years ago

@ptomasula, something about these improvements to readWDM appears to have broken the running of the HSP2.main(hdfname) function. @bcous first discovered this with after updating his branch from the develop-readWDM branch (see https://github.com/respec/HSPsquared/issues/40#issuecomment-792972702), and in the last few days, I've gotten the same error running Test10 from this branch using https://github.com/LimnoTech/HSPsquared/blob/develop_readWDM/tests/test10/HSP2results/TESThsp2.ipynb.

Running this:

%%prun  -l 60 -T NumbaProfile.txt -q
main(hdfname, saveall=True)

Produces this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-17-c797aeee59cd> in <module>
----> 1 get_ipython().run_cell_magic('prun', ' -l 60 -T NumbaProfile.txt -q', 'main(hdfname, saveall=True)\n')

~/opt/anaconda3/envs/hsp2_py38/lib/python3.8/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2397             with self.builtin_trap:
   2398                 args = (magic_arg_s, cell)
-> 2399                 result = fn(*args, **kwargs)
   2400             return result
   2401 

<decorator-gen-48> in prun(self, parameter_s, cell)

~/opt/anaconda3/envs/hsp2_py38/lib/python3.8/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

~/opt/anaconda3/envs/hsp2_py38/lib/python3.8/site-packages/IPython/core/magics/execution.py in prun(self, parameter_s, cell)
    315             arg_str += '\n' + cell
    316         arg_str = self.shell.transform_cell(arg_str)
--> 317         return self._run_with_profiler(arg_str, opts, self.shell.user_ns)
    318 
    319     def _run_with_profiler(self, code, opts, namespace):

~/opt/anaconda3/envs/hsp2_py38/lib/python3.8/site-packages/IPython/core/magics/execution.py in _run_with_profiler(self, code, opts, namespace)
    337         prof = profile.Profile()
    338         try:
--> 339             prof = prof.runctx(code, namespace, namespace)
    340             sys_exit = ''
    341         except SystemExit:

~/opt/anaconda3/envs/hsp2_py38/lib/python3.8/cProfile.py in runctx(self, cmd, globals, locals)
     98         self.enable()
     99         try:
--> 100             exec(cmd, globals, locals)
    101         finally:
    102             self.disable()

<string> in <module>

~/Documents/Python/limno.HSPsquared/HSP2/main.py in main(hdfname, saveall, jupyterlab)
     49 
     50             # now conditionally execute all activity modules for the op, segment
---> 51             ts = get_timeseries(store,ddext_sources[(operation,segment)],siminfo)
     52             flags = uci[(operation, 'GENERAL', segment)]['ACTIVITY']
     53             if operation == 'RCHRES':

~/Documents/Python/limno.HSPsquared/HSP2/main.py in get_timeseries(store, ext_sourcesdd, siminfo)
    204         if row.MFACTOR != 1.0:
    205             temp1 *= row.MFACTOR
--> 206         t = transform(temp1, row.TMEMN, row.TRAN, siminfo)
    207 
    208         tname = f'{row.TMEMN}{row.TMEMSB}'

~/Documents/Python/limno.HSPsquared/HSP2/utilities.py in transform(ts, name, how, siminfo)
     78         pass
     79     elif tsfreq == None:     # Sparse time base, frequency not defined
---> 80         ts = ts.reindex(siminfo['tbase']).ffill().bfill()
     81     elif how == 'SAME':
     82         ts = ts.resample(freq).ffill()  # tsfreq >= freq assumed, or bad user choice

KeyError: 'tbase'

If I disable numba with os.environ['NUMBA_DISABLE_JIT'] = '1', and turn off profiling, I get this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-bdaf3722c4f1> in <module>
      1 # %%prun  -l 60 -T NumbaProfile.txt -q
----> 2 main(hdfname, saveall=True)

~/Documents/Python/limno.HSPsquared/HSP2/main.py in main(hdfname, saveall, jupyterlab)
     49 
     50             # now conditionally execute all activity modules for the op, segment
---> 51             ts = get_timeseries(store,ddext_sources[(operation,segment)],siminfo)
     52             flags = uci[(operation, 'GENERAL', segment)]['ACTIVITY']
     53             if operation == 'RCHRES':

~/Documents/Python/limno.HSPsquared/HSP2/main.py in get_timeseries(store, ext_sourcesdd, siminfo)
    204         if row.MFACTOR != 1.0:
    205             temp1 *= row.MFACTOR
--> 206         t = transform(temp1, row.TMEMN, row.TRAN, siminfo)
    207 
    208         tname = f'{row.TMEMN}{row.TMEMSB}'

~/Documents/Python/limno.HSPsquared/HSP2/utilities.py in transform(ts, name, how, siminfo)
     78         pass
     79     elif tsfreq == None:     # Sparse time base, frequency not defined
---> 80         ts = ts.reindex(siminfo['tbase']).ffill().bfill()
     81     elif how == 'SAME':
     82         ts = ts.resample(freq).ffill()  # tsfreq >= freq assumed, or bad user choice

KeyError: 'tbase'
ptomasula commented 3 years ago

@aufdenkampe, I just issued commit b0edc39 which will prevent the KeyError you encountered above. The KeyError is resulting from the missing 'tbase' key, which I actually cannot figure out where that key value pair should have be initialized from. The purpose of the code that is being executed is to reindex a very sparse timeseries. On line 69, there is a function which determining the frequency of the input timeseries index. I'm guessing for sparse/irregular timeseries that reports back a None object. However in the case of test 10 that function is currently reporting back a None object for regular timeseries as well, which is what is triggering the reindexing. I commented out the reindexing for now because of the missing key issue, and we can reevaluate the necessity of this functionality when we develop the IO abstraction class.

Also in my testing, I found the output of the updated WDMReader was shifting all timesteps by one timestep. I added commit 2014cd0 which fixes this issues. That was causing some indexing issues in the some of the model modules, but hopefully that will no longer be the case.

aufdenkampe commented 3 years ago

@ptomasula, thanks for those fixes!

When I run tests/test10/HSP2results/TEST10_hsp2_compare.ipynb from the develop_readWDM branch, when running:

main(hdfname, saveall=True)

I get this output, with and error in the GQUAL module:

2021-04-08 11:11:20.36   Processing started for file test10_hsp2_dev2WDM.h5; saveall=True
2021-04-08 11:11:24.79   Simulation Start: 1976-01-01 00:00:00, Stop: 1977-01-01 00:00:00
2021-04-08 11:11:24.79      PERLND P001 DELT(minutes): 60
2021-04-08 11:11:24.86         SNOW
2021-04-08 11:11:24.94         PWATER
2021-04-08 11:11:24.99         PSTEMP
2021-04-08 11:11:25.05         PWTGAS
2021-04-08 11:11:25.13      RCHRES R001 DELT(minutes): 60
2021-04-08 11:11:25.23         HYDR
2021-04-08 11:11:25.30         ADCALC
2021-04-08 11:11:25.43         CONS
2021-04-08 11:11:25.48         HTRCH
2021-04-08 11:11:25.55         SEDTRN
2021-04-08 11:11:25.75         GQUAL
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-bdaf3722c4f1> in <module>
      1 # %%prun  -l 60 -T NumbaProfile.txt -q
----> 2 main(hdfname, saveall=True)

~/Documents/Python/limno.HSPsquared/HSP2/main.py in main(hdfname, saveall, jupyterlab)
    118 
    119                 ############ calls activity function like snow() ##############
--> 120                 errors, errmessages = function(store, siminfo, ui, ts)
    121                 ###############################################################
    122 

~/Documents/Python/limno.HSPsquared/HSP2/GQUAL.py in gqual(store, siminfo, uci, ts)
    978                                 if nexits > 1:
    979                                         for n in range(1, nexits):
--> 980                                                 toqal[n] = odqal[n] + tosqal[n]
    981 
    982                                 if avdepe > 0.17:     # simulate decay on suspended sediment

TypeError: 'float' object is not subscriptable

When I compare the HDF5 input files created in new branch, I found a difference in the types of Time Series values.

In the develop_readWDM branch, the HDF5 TS input values are Float64:

/TIMESERIES/TS039/table (Table(8784,), shuffle, blosc(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values": Float64Col(shape=(), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (4096,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "values": Index(6, medium, shuffle, zlib(1)).is_csi=False}

In the develop branch, the HDF5 TS input values are Float32:

/TIMESERIES/TS039/table (Table(8784,), shuffle, blosc(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values": Float32Col(shape=(), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "values": Index(6, medium, shuffle, zlib(1)).is_csi=False}

Note that Float64 are used for TS values in the /TIMESERIES/Saturated_Vapor_Pressure_Table without a problem on develop, so the issue is probably with the GQUAL module.

aufdenkampe commented 3 years ago

@PaulDudaRESPEC, do you have a quick idea/fix for the error in GQUAL when the TIMESERIES values are Float64 rather than Float32? For details, see https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-815972090 above.

aufdenkampe commented 3 years ago

@ptomasula, I found another difference in the files. The Stop datetimes in the /TIMESERIES/SUMMARY table are shorter by one time step with your code, even when the series length is the same. For example:

From the develop_readWDM branch:

TS Start Stop Freq Length
TS039 1976-01-01 00:00:00 1976-12-31 23:00:00 1H 8784

From the develop branch:

TS Start Stop Freq Length
TS039 1976-01-01 00:00:00 1977-01-01 00:00:00 1H 8784
aufdenkampe commented 3 years ago

@ptomasula, I've compared every other value in every table and they are otherwise completely identical.

There is only one minor exception: the row indices are in a slightly different order for the two versions of the '/RCHRES/RQUAL/FLAGS' table. I doubt that maters.

So I'm suspecting that the wrong stop time might be the cause of the errors.

ptomasula commented 3 years ago

@aufdenkampe Thanks for undertaking the comparison of the two files and identifying both of those differences.

I haven't look at the code enough to know how the '/RCHRES/RQUAL/FLAGS' table is used in the main logic, but I'd agree that a difference in the row indices is probably not the likely culprit.

I'll track down the cause of the difference in the timeseries stop dates and we'll see if that resolves it.

aufdenkampe commented 3 years ago

@ptomasula, thanks for that. I've been doing a bit of code exploring today, and I found that the get_uci(store) function gets the run time for the model from the /CONTROL/GLOBAL table from the UCI, which for both HDF5 input files has this:

index Info
Comment Version 11 test run: PERLND and IMPLND w/ RCHR...
Start 1976-01-01 00:00
Stop 1977-01-01 00:00

So the Stop time mismatch might be the issue.

PaulDudaRESPEC commented 3 years ago

@aufdenkampe I made a small fix to GQUAL just now for a subscript out of range problem when nexits > 1. Not sure if that was the root of the problem being discussed here, but it sure wasn't helping!

aufdenkampe commented 3 years ago

@PaulDudaRESPEC, thanks for your fix 98d44f0!

I pulled your fix into our develop_readWDM branch, but unfortunately, we still got the same error when running from our branch:

2021-04-09 11:50:24.66   Processing started for file test10_hsp2_dev2WDM.h5; saveall=True
2021-04-09 11:50:29.62   Simulation Start: 1976-01-01 00:00:00, Stop: 1977-01-01 00:00:00
2021-04-09 11:50:29.63      PERLND P001 DELT(minutes): 60
2021-04-09 11:50:31.65         SNOW
2021-04-09 11:50:33.32         PWATER
2021-04-09 11:50:33.40         PSTEMP
2021-04-09 11:50:33.46         PWTGAS
2021-04-09 11:50:33.53      RCHRES R001 DELT(minutes): 60
2021-04-09 11:50:33.66         HYDR
2021-04-09 11:50:33.94         ADCALC
2021-04-09 11:50:34.12         CONS
2021-04-09 11:50:34.18         HTRCH
2021-04-09 11:50:34.25         SEDTRN
2021-04-09 11:50:34.44         GQUAL
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-bdaf3722c4f1> in <module>
      1 # %%prun  -l 60 -T NumbaProfile.txt -q
----> 2 main(hdfname, saveall=True)

~/Documents/Python/limno.HSPsquared/HSP2/main.py in main(hdfname, saveall, jupyterlab)
    118 
    119                 ############ calls activity function like snow() ##############
--> 120                 errors, errmessages = function(store, siminfo, ui, ts)
    121                 ###############################################################
    122 

~/Documents/Python/limno.HSPsquared/HSP2/GQUAL.py in gqual(store, siminfo, uci, ts)
    978                                 if nexits > 1:
    979                                         for n in range(0, nexits-1):
--> 980                                                 toqal[n] = odqal[n] + tosqal[n]
    981 
    982                                 if avdepe > 0.17:     # simulate decay on suspended sediment

TypeError: 'float' object is not subscriptable

@ptomasula, can you update readWDM.py to add 1-extra time step to the Stop time in the /TIMESERIES/SUMMARY table, as shown in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-816081613? Thanks!

ptomasula commented 3 years ago

@aufdenkampe Thanks again for doing that testing. My latest commit has a fix for the stop time mismatch issues in the TIMESERIES/SUMMARY table. I added logic to have the group processing function return the group end datetime of the last group processed. The stop parameter now matches those generated from the reader in the develop branch.

Unfortunately, when I reran test10 I still get the same float indexing error in the GQUAL module. Notably, the resultant h5 file from my test run does not have the /CONTROL/GLOBAL table populated so that could be part of the issue. Would you be able to try the latest commit and see if you have the same issue.

aufdenkampe commented 3 years ago

@ptomasula, ran HSPsquared/tests/test10/HSP2results/TEST10_hsp2_compare.ipynb with your your new commit, and I got this new error when running readWDM(wdmname, hdfname):

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-02f71edae935> in <module>
      1 # %time
----> 2 readWDM(wdmname, hdfname)

~/Documents/Python/limno.HSPsquared/HSP2tools/readWDM.py in readWDM(wdmfile, hdffile, compress_output)
    129             records = np.asarray(records)
    130             offsets = np.asarray(offsets)
--> 131             dates, values = _process_groups(iarray, farray, records, offsets, tgroup)
    132 
    133             dates, values, stop_datetime = _process_groups(iarray, farray, records, offsets, tgroup)

TypeError: cannot unpack non-iterable NoneType object

I hadn't seen that error previously.

ptomasula commented 3 years ago

Whoops. Thanks @aufdenkampe for catching that my commit should have replaced line 131 in the above with the line 133. Looks like a few deletions were left out of my pervious commit. I added commit ae374f1f which should fix those lines.

aufdenkampe commented 3 years ago

@ptomasula, that commit fixed the readWDM error I had in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-818264729.

I have confirmed that the Stop datetimes in the /TIMESERIES/SUMMARY table is identical for the HDF5 files created on the develop and develop_readWDM branches. So this also fixed the issue described in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-816081613.

Unfortunately, I'm also still getting the TypeError: 'float' object is not subscriptable error at the GQUAL module when running main(hdfname, saveall=True) on the input HDF5 file created from the latest on the develop_readWDM` branch (as shown in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-815972090).

So, maybe it is one of these remaining differences:

  1. The rows are in a different order in '/RCHRES/RQUAL/FLAGS', which provides BENRFG (benthic release flag) values for each reach.
  2. The TIMESERIES datatypes are float32 from develop vs float64 from develop_readWDM.

@PaulDudaRESPEC, any ideas?

PaulDudaRESPEC commented 3 years ago

I just made another change in the our develop branch to ADCALC.py, which should resolve the issue with "'float' object is not subscriptable" in GQUAL. However, that error appears to be a bit of a red herring -- just a symptom of the problem, and I suspect it doesn't fix the underlying issue. Remember this all works fine with Test10 in RESPEC's develop branch. So it seems there's still an issue in the LimnoTech changes to readWDM.py. Hopefully this fix gives you a more revealing error message.

aufdenkampe commented 3 years ago

@PaulDudaRESPEC, your commit c01199ac781c5fb3f7d58fcd96cd6c69d759b4f4 worked to get HSP2 to run without errors with HDF5 files from new readWDM.

I still need to compare outputs!

PaulDudaRESPEC commented 3 years ago

@aufdenkampe Right, the fact that you hit this error in GQUAL tells me that you are in a different part of the conditional that I never get to in Test10, so that raises a concern.

aufdenkampe commented 3 years ago

@PaulDudaRESPEC, we agree. There is still an issue here that we need to work out. @ptomasula and I just discussed a plan to figure this out. In that process, we may locally "turn off" your recent fix (https://github.com/LimnoTech/HSPsquared/commit/c01199ac781c5fb3f7d58fcd96cd6c69d759b4f4) to make sure we can get our input files to pass.

aufdenkampe commented 3 years ago

I'm back to testing the develop_readWDM branch.

@ptomasula and I recognized that it might be worth reversing his commit b0edc395b7a97d794e1dbbc742827f1e91184e4a to "Remove sparse timeseries fill" that he mentions in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-811136052 above. He commented out those lines of code because it didn't appear that Test 10b was encountering it.

I just un-commented those lines with commit 479e6d6, and I get again get a similar KeyError: 'tbase', but now in the PERLND module:

main(hdfname, saveall=True)

2021-04-23 08:18:20.34   Processing started for file test10_hsp2_dev2WDM_5.h5; saveall=True
2021-04-23 08:18:25.75   Simulation Start: 1976-01-01 00:00:00, Stop: 1977-01-01 00:00:00
2021-04-23 08:18:25.75      PERLND P001 DELT(minutes): 60
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-bdaf3722c4f1> in <module>
      1 # %%prun  -l 60 -T NumbaProfile.txt -q
----> 2 main(hdfname, saveall=True)

~/Documents/Python/limno.HSPsquared/HSP2/main.py in main(hdfname, saveall, jupyterlab)
     49 
     50             # now conditionally execute all activity modules for the op, segment
---> 51             ts = get_timeseries(store,ddext_sources[(operation,segment)],siminfo)
     52             flags = uci[(operation, 'GENERAL', segment)]['ACTIVITY']
     53             if operation == 'RCHRES':

~/Documents/Python/limno.HSPsquared/HSP2/main.py in get_timeseries(store, ext_sourcesdd, siminfo)
    204         if row.MFACTOR != 1.0:
    205             temp1 *= row.MFACTOR
--> 206         t = transform(temp1, row.TMEMN, row.TRAN, siminfo)
    207 
    208         tname = f'{row.TMEMN}{row.TMEMSB}'

~/Documents/Python/limno.HSPsquared/HSP2/utilities.py in transform(ts, name, how, siminfo)
     78         pass
     79     elif tsfreq == None:     # Sparse time base, frequency not defined
---> 80          ts = ts.reindex(siminfo['tbase']).ffill().bfill()
     81     elif how == 'SAME':
     82         ts = ts.resample(freq).ffill()  # tsfreq >= freq assumed, or bad user choice

@PaulDudaRESPEC, can you tell us a bit more about how tbase is filled out from the inputs?

PaulDudaRESPEC commented 3 years ago

A few observations: When I run test10 I never get to line 80 of utilities.py. Also, there is no 'tbase' in siminfo. Maybe instead of 'tbase' that should be 'tindex' in line 80? But that fact that you get to this line at all suggests that the timeseries you've written to the HDF5 file using readWDM.py are different than the ones written using the code in master. I would expect that you could compare the contents of your HDF5 file with the one written by master (using something like HDFView) and see how yours is different.

PaulDudaRESPEC commented 3 years ago

@aufdenkampe and @ptomasula , I think I've discovered the root of the problem here. In your version of readWDM, 'freq' never gets set. Try adding these lines of code at line 137 of readWDM:

         if tstep == 1:
                series.index.freq = freq[tcode]
         else:
                series.index.freq = str(tstep) + freq[tcode]

I don't know if that solves your issue with sparse timeseries, but it will get your test10 to run!

ptomasula commented 3 years ago

@aufdenkampe @PaulDudaRESPEC Thanks Paul!

Your observation and proposed fix above is spot on. Great work. The tip off for us should have been line 69 of utilities.py which is pulling ts.index.freq. When that comes up as a None object we end up on Line 80. Thanks for finding that.

I implemented commit b9e635f which has an equivalent to your proposed fix and test10 now seems to execute without issue.

aufdenkampe commented 3 years ago

@PaulDudaRESPEC, that worked! Thank you for finding that! @ptomasula is writing more details now. In the meanwhile, I'll start preparing a pull request with our better binary time series reading capabilities for readWDM.py. I see that @TongZhai is making lots of contributions to your develop branch and we want his work to build on our contributions.

aufdenkampe commented 3 years ago

I found a new issue, when running the Calleg test on the develop_readWDM branch that doesn't occur on develop:

readWDM(wdmname, hdfname)
401 reading from wdm
402 reading from wdm
403 reading from wdm
404 reading from wdm
405 reading from wdm
406 reading from wdm
407 reading from wdm
408 reading from wdm
301 reading from wdm
322 reading from wdm
303 reading from wdm
304 reading from wdm
305 reading from wdm
306 reading from wdm
307 reading from wdm
312 reading from wdm
313 reading from wdm
1802 reading from wdm
1005 reading from wdm
1011 reading from wdm
776 reading from wdm
780 reading from wdm
781 reading from wdm
782 reading from wdm
800 reading from wdm
802 reading from wdm
803 reading from wdm
805 reading from wdm
806 reading from wdm
841 reading from wdm
778 reading from wdm
830 reading from wdm
831 reading from wdm
832 reading from wdm
833 reading from wdm
834 reading from wdm
836 reading from wdm
838 reading from wdm
839 reading from wdm
1169 reading from wdm
1171 reading from wdm
1239 reading from wdm
1000 reading from wdm
49 reading from wdm
141 reading from wdm
154 reading from wdm
175 reading from wdm
177 reading from wdm
187 reading from wdm
188 reading from wdm
234 reading from wdm
238 reading from wdm
249 reading from wdm
259 reading from wdm
168 reading from wdm
169 reading from wdm
190 reading from wdm
193 reading from wdm
194 reading from wdm
196 reading from wdm
227 reading from wdm
9306 reading from wdm
250 reading from wdm
321 reading from wdm
302 reading from wdm
323 reading from wdm
324 reading from wdm
325 reading from wdm
8700 reading from wdm
7021 reading from wdm
9202 reading from wdm
9101 reading from wdm
9809 reading from wdm
8300 reading from wdm
7029 reading from wdm
7020 reading from wdm
7053 reading from wdm
7065 reading from wdm
9501 reading from wdm
7057 reading from wdm
7011 reading from wdm
8100 reading from wdm
9402 reading from wdm
311 reading from wdm
9909 reading from wdm
6141 reading from wdm
6175 reading from wdm
6177 reading from wdm
6190 reading from wdm
6194 reading from wdm
6049 reading from wdm
6238 reading from wdm
6250 reading from wdm
6001 reading from wdm
6259 reading from wdm
9701 reading from wdm
7034 reading from wdm
9502 reading from wdm
242 reading from wdm
9916 reading from wdm
7001 reading from wdm
335 reading from wdm
9913 reading from wdm
7017 reading from wdm
9207 reading from wdm
9504 reading from wdm
9702 reading from wdm
9804 reading from wdm
9109 reading from wdm
9906 reading from wdm
---------------------------------------------------------------------------
OutOfBoundsDatetime                       Traceback (most recent call last)
<ipython-input-4-22674173fb80> in <module>
      1 get_ipython().run_line_magic('time', '')
----> 2 readWDM(wdmname, hdfname)

~/Documents/Python/limno.HSPsquared/HSP2tools/readWDM.py in readWDM(wdmfile, hdffile, compress_output)
    134             dates = np.array(dates)
    135             dates_converted = date_convert(dates, date_epoch, dt_year, dt_month, dt_day, dt_hour, dt_minute, dt_second)
--> 136             series = pd.Series(values, index=dates_converted)
    137             series.index.freq = str(tstep) + freq[tcode]
    138 

~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    279 
    280             if index is not None:
--> 281                 index = ensure_index(index)
    282 
    283             if data is None:

~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in ensure_index(index_like, copy)
   5915             index_like = copy_func(index_like)
   5916 
-> 5917     return Index(index_like)
   5918 
   5919 

~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
    293             from pandas import DatetimeIndex
    294 
--> 295             return _maybe_asobject(dtype, DatetimeIndex, data, copy, name, **kwargs)
    296 
    297         elif is_timedelta64_dtype(data_dtype) or is_timedelta64_dtype(dtype):

~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in _maybe_asobject(dtype, klass, data, copy, name, **kwargs)
   6169         return index.astype(object)
   6170 
-> 6171     return klass(data, dtype=dtype, copy=copy, name=name, **kwargs)
   6172 
   6173 

~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in __new__(cls, data, freq, tz, normalize, closed, ambiguous, dayfirst, yearfirst, dtype, copy, name)
    305         name = maybe_extract_name(name, data, cls)
    306 
--> 307         dtarr = DatetimeArray._from_sequence_not_strict(
    308             data,
    309             dtype=dtype,

~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in _from_sequence_not_strict(cls, data, dtype, copy, tz, freq, dayfirst, yearfirst, ambiguous)
    324         freq, freq_infer = dtl.maybe_infer_freq(freq)
    325 
--> 326         subarr, tz, inferred_freq = sequence_to_dt64ns(
    327             data,
    328             dtype=dtype,

~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in sequence_to_dt64ns(data, dtype, copy, tz, dayfirst, yearfirst, ambiguous)
   1994         data = getattr(data, "_data", data)
   1995         if data.dtype != DT64NS_DTYPE:
-> 1996             data = conversion.ensure_datetime64ns(data)
   1997 
   1998         if tz is not None:

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.ensure_datetime64ns()

pandas/_libs/tslibs/np_datetime.pyx in pandas._libs.tslibs.np_datetime.check_dts_bounds()

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-05-05 00:00:00
PaulDudaRESPEC commented 3 years ago

@aufdenkampe , looks like a problem reading data set number 9906. Somewhere around the 3429th value, the dates read are incorrect, getting values far into the future.
9906.txt Here's what that data set number looks like when dumped to the seq format.

aufdenkampe commented 3 years ago

@PaulDudaRESPEC, thanks for that sequence file for the 9906 timeseries. @ptomasula and I figured some of this out on Friday, and Paul T. found the the likely issue in his code. He should be working on the fix today. The sequence file will help confirm the fix.

ptomasula commented 3 years ago

@PaulDudaRESPEC @aufdenkampe

I added commit ee429de to resolve the issue Anthony reported when running the Calleg test file. Series 9906 has a block that ends on the 511th element of a record and the code was trying to process the 512th element (which was an invalid block control word). We actually had fixed this behavior previously, but it looks like the code might have been lost in refactoring.

After this commit Test10 and Calleg can both been read by the WDMReader and the model executed. However when I tested Brandan's RPO_SWMM48LINKS2017_wCBOD_June2019.wdm the reader encountered a ValueError: Inferred frequency None from passed values does not conform to passed frequency 15T when trying to assign the series.index.freq of the datetime index. The freq attribute is intended to describe the spacing between timesteps in a datetime index, so I believe the value error is the results of trying to define a frequency for an irregular timeseries. We also know from previous testing that not having freq attribute set (i.e. freq=None) causes issues with some of the model modules.

We could use the resample method of pandas to generate a regular timeseries, but I need a better understanding on how HSPF handles irregular timeseries before I'd recommend that solution. I think that HSPF has some different options for filling in gaps between timeseries. Maybe we can have a meeting to best discuss how to handle irregular timeseries?

PaulDudaRESPEC commented 3 years ago

@ptomasula , I have some availability this week if you'd like to schedule something. Maybe Thursday or Friday morning?

aufdenkampe commented 3 years ago

@ptomasula, @PaulDudaRESPEC, @TongZhai and I just decided to complete this issue thread with a Pull Request, even if we still don't have a fix for irregular time series.

I just created Handle irregular time series input #51 to track that work.

Closing this issue as we will merge PR #35 as soon as we resolve a merge conflict.