Closed aufdenkampe closed 3 years ago
Connected to #20 for info, #22 for testing, 23 for reading into Pandas data frames and writing outputs to Parquet.
Update: LimnoTech & Respec met Feb. 10 and outlined to rewrite the getfloats
function in readWDM.py
by mimicking the TimeseriesDataFromWdm
in the BASINS atcWDMfile.vb
script, which does not rely on a Fortran DDL file.
@jlkittle pointed out that this WDM debugger tool would be very helpful: https://github.com/respec/FORTRAN/blob/master/lib3.0/BIN/WDMRX.EXE.
@PaulDudaRESPEC shared the scanned "WDMProgrammersGuide.pdf", which I'll post somewhere in this repo.
An initial rewrite of getfloats was implemented as the new function process_groups. The process_groups function is based off the logic in TimeseriesDataFromWdm from RESPEC's BASINS code, which processes timeseries as groups and blocks. Each group contains one or more blocks, which are structured data records. Each block begins with a control word (a 32bit integer) that contains information about the block's structure, and then continues with one more 32bit float values based on the number of values parameters specified in the block control word. The logic was implemented in commit c5b2a1cdd6a55eccc0db67d7840ec3eaf904dcec and was able to successfully read and convert two testing WDM files: rpo772.wdm and the test10b file TEST.WDM. However additional testing with RPO_SWMM48LINKS2017_wCBOD_June2019.wdm revealed two issues discussed further below.
The irregular timestep means that the exact dimensions required for a numpy array to hold the converted timeseries could not precisely determined prior to processing the timeseries groups. Allocating a numpy array requires a fixed size, and misjudging that size results in an IndexError (which also isn't handled by the current code) and the file processing to fail. The current solution, which was committed as bad3771d4d6c06e0b2aefbd322273c990a025021, allocates the numpy array in chucks and expands the array when processing the next block would store values that exceed the current boundary of the array. However this solution is not ideal. Arrays are not really extendable. Reallocating the array, even in chunks, means copy the previous array to a larger array in a new block of memory. This drastically impacts performance and the impact of this approach will be felt even more with larger files.
Next Steps:
Timeseries 134 of RPO_SWMM48LINKS2017_wCBOD_June2019.wdm contains a block which end on the 511th element of a record. The previous logic tried to parse the 512th element of that record as a block control word, however the 512th element was not a valid block control word. This would cause issues in processing the remaining blocks in the timeseries. Working on the assumption that blocks are always at least two words long I implement logic in commit 129a228922539c524fcfe7c90ec65b9b19291091, which moves to the next record in the timeseries chain of records when a block ends either on 511th or 512th element of the record.
Next Steps:
Looping in @bcous for notifications.
I've made some good progress based on Paul's previous work and VB code. I added an alternative implementation of process_groups() function (commit 0f5a2d1f4fef08aee5858a9f7ec5af47e55f8aa7). The two major changes I made from Paul's version are:
added a loop to iterate through each group and used ending date as the ending condition. VB code was implemented this way.
All three of these WDM files can be read successfully now.
Moving excerpts of conversation from Teams on Mar. 1, 2021:
[Monday 10:10 AM] Paul Tomasula
Thanks Hua, excellent work on this! I reviewed the code and it looks solid to me. Have you done any comparisons between the converted time series with those from the BASINS code?
Ideally we automate this comparison as a test in the future, but for now spot checking a few time series by hand should do. I'd mostly be interested in making sure some of the time series in the RPO_SWMM48LINKS2017_wCBOD_June2019.wdm match since that file has the funky block ending on the 511th element of a record.
[Monday 10:24 AM] Hua Tao
Now that I remember that dates in the VB generated csv file is Julian Year. You must convert them back to normal date/time format. I can point you to those. I placed the two folders of csv here: S:\DOESBIR2\HSP2-Dev\WDM-TestFiles
[Yesterday 11:12 AM] Paul Tomasula
Hey Hua, I did some spot checking on times series and values are near identical. The run time is considerable though. I'm going to take a crack at some refactoring. I'll primarily focus on getting numba implemented and we'll see how it looks from there.
Moving excerpts of conversation from Teams on Mar. 3, 2021:
[9:03 AM] Paul Tomasula
Refactoring and Reader Performance
Hua and Anthony, I did some general refactoring of the WDM reader yesterday to tidy up the code (remove deprecated functions and irrelevant comments, clarify variables through better naming, raise exceptions in place of print statement for errors, etc). (Commits b600ebf3689ab01391a1b4c7b1b9a747c13c2372, 256186994681af73ccfbd94e381d3309ae1d1cbb, 2e205f12e1da856f90830b17b0d6a69aecf48a7f). I also tried to implement numba, but was ultimately unsuccessful.
Numba implementation is best accomplished through the @njit decorator, which performs just in time compiling of a function without using python structures. When done correctly, this will allow python code to approach the speeds C or Fortran with a little extra runtime for compiling. However numba also has rather restrictive data type allowances and is particularly restrictive with datetimes. The only supported structures are numpy's datatime64 and timedelta64 objects, and even then it's not a full implementation. For example creation on a datetime64 object in not presently possible within a numba function. Numpy's datetime object is also extremely difficult to use if you have a variable unit for timedetla (i.e. some in years, others in day, minutes, etc) like the WDM files do. Years and months are nonlinear so there's no easy way to convert those units to a timedelta in seconds. I did start to think about a work around to manually convert years and months. The implementation will be pretty messy though and would reduce code readability and maintainability. Also the complexity of the workaround doesn't guarantee a runtime boast because it needs to run a bunch of extra calculations to convert years and months to days.
I guess my question is, what do you all think out about this? The runtime is still considerable. It took my machine over 20 minutes to convert the RPO_SWMM48LINKS2017_wCBOD_June2019.wdm to an HDF file. Should we leave it at this? Any other thoughts to bump up performance? Parallelization will help a little bit but we'd still be looking at a considerable runtime.
[9:32 AM] Hua Tao
Paul, I never used Numba so I don't have a lot insight about it. The code that we developed to read WDM files are pretty simple. The only part that might be time consuming is probably those date/time calculation in Pandas. The rest of the code is very primitive calculations and loops. I don't know whether and how much Numba is able to recompile and speed up those. I suspect that the writing a HD5 file (hundreds Meg) to disk is probably the most time consuming part. If we only keep the WDM data in the RAM, it is probably not too bad. We can strip the HD5 out from the reader and make it a separate ExporttoHD5 method() and only call it when needed.
[9:35 AM] Anthony Aufdenkampe
Paul, I'm wondering if we follow Hua's suggestion to skip use of Numpy until the very end of the ingestion process, or at the very least, don't convert the datetime from Fortran (isn't it just a float?) until after all the looping is done. That way you can use Numba.
[9:51 AM] Paul Tomasula
I couple of quick response though. I isolated the process_groups method for testing and just running that I average about 3 seconds per time series. The RPO...June2019.wdm file has 432 time series so its 21ish minutes for executing that step. This definitely appears to be at least one of the bottle necks. The writing could also be pretty substantial though I haven't tested that.
The issue with isolating the datetime conversions is that's currently integral to the processing loop. We use the datetime to indicate when the loop needs to stop executing. Even putting that piece aside for now. The only datetime information stored is in the block control word and it specifies the timedelta for each value. This means that each timestep is dependent on the previous timestep in its calculation. So we could maybe store the timedelta information to have a numba compatible loop but I'd imagine looping through a second time to convert the timedelta to a timeseries is probably slower than our current approach.
[9:57 AM] Hua Tao
Paul, then I would think parallelization would help improve speed. There is no dependencies between time series. This approach is a low hanging fruit, I would think. What do you think ?
[10:00 AM] Paul Tomasula
Yeah I agree. We definitely want parallelization added in as well regardless. Maybe as a first step we implement parallelization and see what performance looks like? Then revisit refining the processing method only if necessary.
[11:43 AM] Anthony Aufdenkampe
Paul, that's great that you were able to profile the typical speed for each time series. I agree that 3 s per series sounds fast, but perhaps you could speed it up to 0.3 s or faster by using Numba. Let's try get the time stamp conversion outside the loop.
Hua, what is the total run time of the VB code to run that same RPO...June2019.wdm file? Knowing that would be very helpful.
Paul, when its time to parallelize, I would really like to go with Dask for this and for the rest of the rep. It looks relatively simple according to this tutorial: http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/dask.html. Also, Dask also provides Dask Dataframe, which is very similar to Pandas Dataframe but has built-in parallelization and also allows for very large datasets that are bigger than RAM. We might want to use these instead of Pandas dataframes throughout the repo.
[12:02 PM] Hua Tao
VB.NET uses Just-In-Time compiler so performance wise it will be worse but comparable to FORTRAN. I haven't profiled performance for this particular reader. I would think it's 10 to 20 times faster than Python based on my past experience. It's great that you suggested going with Dask. There are many options. This saves us time.
Anthony Aufdenkampe
Hua, great! Could you try timing the VB.net code today or tomorrow?
[12:22 PM] Paul Tomasula
Thanks Hua, Sounds good! I need to change gears for this afternoon but knowing the runtime for VB.NET will be helpful to see the level of performance we're aiming for. In the meantime I'll start thinking about if it is possible to pull the time conversion out into a separate loop or some other work around. Our call with Jack on Friday might also offer up some clarity. If we can figure out how to decouple datetime from the loop control (using an indexing approach or something similar) then it might be possible to perform those conversions in a later step.
[12:42 PM] Hua Tao
VB.NET: half second on average for each time series. Yes, only 1/2 second each time series. This is still in debug mode. Release mode will be even faster.
[1:56 PM] Anthony Aufdenkampe
Hua, thanks for timing it! So ~6x faster in debug mode, and faster once compiled. That's our goal! I suspect we can get close to that fast if we can use numba, and faster once we apply Dask.
Paul & Hua, let's chat sometime before Friday to figure this out. Remind me how datetime is stored in the WDM again? Maybe we emulate that.
@aufdenkampe, datetime is stored within the groups structure. The first line of a group contains the start date of the group, followed by a series of blocks. Each block starts with a block control word followed by one or more data values. Through bit manipulation, the block control word is parsed into information about the block's structure including the number of values in the block, the timestep, and timestep unit. In this way you can increment the previous date (or group's start date if you are processing the first block in the group) to calculate what the date is for value you are currently reading.
@ptomasula, thanks! I was thinking it was something like that. I also was glancing over the WDMProgrammersGuide.pdf and saw a description on page 101 (Appendix B), below.
So, within every time series data block, the BLOCK START date time is stored as a series of integers (after decoding from bits), and the BLOCK TIME UNITS and BLOCK TIME STEP are also stored as integers.
Why don't we convert these times as decimal days (floats) after a certain time, similar to the VB code? We would then be able to do the translation into Pandas date time in one step after everything is read.
@htaolimno, what is the start day for the time in the CSV files you exported using VB. In other words, what date corresponds to a value of 37987? In Excel January 1, 1900 has a date serial number of 1. If these were designed for use in Excel, then 37987.000 = January 1, 2004 at 0:00 hours (midnight).
@aufdenkampe Datetime in WDM is encoded in a 32bit binary. See the image below: In the VB code, Julian years are calculated. The value is same as Excel, January 1, 1900 has a date serial number of 1
@aufdenkampe @htaolimno I gave this some more thought last night and came up with a conceptual design for a numba compatible date converter. This approach hinges on numpy arrays so we'll need to figure out if there's a way to determine the size of an array before processing a group/timeseries. We also need for the block processing loop control to be independent of datetime. Perhaps we can go back to the frepos approach.
Assuming we can meet those conditions, this method involves tracking the time offset from the epoch (1970-01-01T00:00:00). While reading the blocks, separate arrays will be used to store the offset of a given value in each time unit. Later we can apply those offsets within a numba compatible function the generate a date array. See the attached notebook which tests the conversion of 600K datetimes (comparable to the length of the timeseries in RPO testing files). The initial run plus compilation takes about ~0.5 seconds. After compilation it runs in ~0.01 seconds on my machine.
I think there's still some questions to answer to see if we can meet the conditions required to implement this method, but I think this is the direction we'll want to head if possible.
@ptomasula, thanks for thinking through and testing that approach. It definitely looks promising.
Regarding my suggestion in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-790113950 above, I'm aware that the date time is encoded as a binary, which is decoded into sets of integers for year-month-day-hour and in some cases also minute-second.
My thinking is that we collect these datetime stamps in one of those formats (a byte object or a list/tuple of integers) and then convert it later after the loops are completed.
Alternately, we do the initial conversion to decimal day after 1/1/1900 inside (a float) the loop, using the same algorithm as the VB code.
This allows us to avoid the needing the special datatime64 data type and preallocating the size of the Numpy array until after all the looping is done.
I just created, using my wife's Acrobat Pro, a searchable version of the WDM guide that also has bookmarks for sidebar navigation. This should help a ton. See WDMProgrammersGuide+Search+Nav.pdf
@aufdenkampe Thanks for the searchable PDF. This is most wonderful!
That is an interesting thought about using decimal day. I tried a similar approach early on, but the hurdle was how to deal with timestep units of years and months. I don't know that I have a great method for determining the current month of a decimal day, which is critical in determining how many days to add when incrementing by month. Also leap years are difficult to handle. The BASINS code uses a function TimAddJ which I assume handles these compilations. I cannot seem to find the actual code for that function though. Any chance you can work your GitHub magic to locate the function definition?
Though I also like your suggestion about using some type of bit format to store date information. That solves the annoying issues around using the datetime64 object like the implementation my previous comment. This way we can perform the date incrementing inside of the main block processing loop and still use date as the loop control. The main challenge with this approach is balancing the runtime for the conversion. To my knowledge, Numba doesn't support the creation of any type datetime objects, so we'd need to handle that conversion with some alternative method. On the upside, in this approach the conversions would be completely independent from the previous timestep so we can vectorize it. That's probably fast enough on its own without the need to leverage Numba so I think this idea has considerable merit.
@aufdenkampe @htaolimno, I just added commit e5d64a1 which applies the bit date handling approach we discussed above. This also supports numba for the _process_groups function.
The runtime is still considerable at about 4 seconds per timeseries but the bottle neck has been better contained and moved to a single conversion to datetime step. The actual _process_groups function runs in ~0.08 seconds on my machine. The conversion to datetime is ~3 seconds and then ~1 second for writing to HDF. The current datetime conversion approach uses the pandas apply function with a lambda which is probably not the most efficient approach. I'll work on vectorizing it.
@ptomasula I think all the Julian date related utility functions are in the basin modDate.vb .Also you may find Python implementation of Julian date from this link.
Thanks @htaolimno. I think these are good references. We'll have to see it it's necessary to implement these. In theory we can just continue with the bit datetime approach I've implement. Once we have the dates as Datetime object we can you this trick to get a Julian Date.
Datetime.toordinal() + 1721424.5
@htaolimno @aufdenkampe. I've been trying to think of good way to speed up the datetime conversion. Unfortunately I couldn't find a great pandas based approach, because I cannot find a way were the first step is to not to use the bits_to_date on the series. To apply that function requires using the apply or map functions from pandas, both of which are faster than looping but not fast enough. So I went back to a numba approach. Numba doesn't support the creation of a datetime64 object, but it does support datetime arithmetic. Commit 1b52a17 implements an approach which calculates the offset from the unix epoch and the applies the timedelta64 object to calculate the date as a datetime64. This is a similar approach the the one I took in one of my pervious comments.
This solution is not as elegant as a pandas based one, but I think it'll get the job done. My testing showed method averaged about 0.4 seconds for datetime conversion on my PC. This is fast enough that the reading process is now IO bound.
@ptomasula, something about these improvements to readWDM
appears to have broken the running of the HSP2.main(hdfname)
function. @bcous first discovered this with after updating his branch from the develop-readWDM
branch (see https://github.com/respec/HSPsquared/issues/40#issuecomment-792972702), and in the last few days, I've gotten the same error running Test10 from this branch using https://github.com/LimnoTech/HSPsquared/blob/develop_readWDM/tests/test10/HSP2results/TESThsp2.ipynb.
Running this:
%%prun -l 60 -T NumbaProfile.txt -q
main(hdfname, saveall=True)
Produces this error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-17-c797aeee59cd> in <module>
----> 1 get_ipython().run_cell_magic('prun', ' -l 60 -T NumbaProfile.txt -q', 'main(hdfname, saveall=True)\n')
~/opt/anaconda3/envs/hsp2_py38/lib/python3.8/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2397 with self.builtin_trap:
2398 args = (magic_arg_s, cell)
-> 2399 result = fn(*args, **kwargs)
2400 return result
2401
<decorator-gen-48> in prun(self, parameter_s, cell)
~/opt/anaconda3/envs/hsp2_py38/lib/python3.8/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):
~/opt/anaconda3/envs/hsp2_py38/lib/python3.8/site-packages/IPython/core/magics/execution.py in prun(self, parameter_s, cell)
315 arg_str += '\n' + cell
316 arg_str = self.shell.transform_cell(arg_str)
--> 317 return self._run_with_profiler(arg_str, opts, self.shell.user_ns)
318
319 def _run_with_profiler(self, code, opts, namespace):
~/opt/anaconda3/envs/hsp2_py38/lib/python3.8/site-packages/IPython/core/magics/execution.py in _run_with_profiler(self, code, opts, namespace)
337 prof = profile.Profile()
338 try:
--> 339 prof = prof.runctx(code, namespace, namespace)
340 sys_exit = ''
341 except SystemExit:
~/opt/anaconda3/envs/hsp2_py38/lib/python3.8/cProfile.py in runctx(self, cmd, globals, locals)
98 self.enable()
99 try:
--> 100 exec(cmd, globals, locals)
101 finally:
102 self.disable()
<string> in <module>
~/Documents/Python/limno.HSPsquared/HSP2/main.py in main(hdfname, saveall, jupyterlab)
49
50 # now conditionally execute all activity modules for the op, segment
---> 51 ts = get_timeseries(store,ddext_sources[(operation,segment)],siminfo)
52 flags = uci[(operation, 'GENERAL', segment)]['ACTIVITY']
53 if operation == 'RCHRES':
~/Documents/Python/limno.HSPsquared/HSP2/main.py in get_timeseries(store, ext_sourcesdd, siminfo)
204 if row.MFACTOR != 1.0:
205 temp1 *= row.MFACTOR
--> 206 t = transform(temp1, row.TMEMN, row.TRAN, siminfo)
207
208 tname = f'{row.TMEMN}{row.TMEMSB}'
~/Documents/Python/limno.HSPsquared/HSP2/utilities.py in transform(ts, name, how, siminfo)
78 pass
79 elif tsfreq == None: # Sparse time base, frequency not defined
---> 80 ts = ts.reindex(siminfo['tbase']).ffill().bfill()
81 elif how == 'SAME':
82 ts = ts.resample(freq).ffill() # tsfreq >= freq assumed, or bad user choice
KeyError: 'tbase'
If I disable numba with os.environ['NUMBA_DISABLE_JIT'] = '1'
, and turn off profiling, I get this error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-bdaf3722c4f1> in <module>
1 # %%prun -l 60 -T NumbaProfile.txt -q
----> 2 main(hdfname, saveall=True)
~/Documents/Python/limno.HSPsquared/HSP2/main.py in main(hdfname, saveall, jupyterlab)
49
50 # now conditionally execute all activity modules for the op, segment
---> 51 ts = get_timeseries(store,ddext_sources[(operation,segment)],siminfo)
52 flags = uci[(operation, 'GENERAL', segment)]['ACTIVITY']
53 if operation == 'RCHRES':
~/Documents/Python/limno.HSPsquared/HSP2/main.py in get_timeseries(store, ext_sourcesdd, siminfo)
204 if row.MFACTOR != 1.0:
205 temp1 *= row.MFACTOR
--> 206 t = transform(temp1, row.TMEMN, row.TRAN, siminfo)
207
208 tname = f'{row.TMEMN}{row.TMEMSB}'
~/Documents/Python/limno.HSPsquared/HSP2/utilities.py in transform(ts, name, how, siminfo)
78 pass
79 elif tsfreq == None: # Sparse time base, frequency not defined
---> 80 ts = ts.reindex(siminfo['tbase']).ffill().bfill()
81 elif how == 'SAME':
82 ts = ts.resample(freq).ffill() # tsfreq >= freq assumed, or bad user choice
KeyError: 'tbase'
@aufdenkampe, I just issued commit b0edc39 which will prevent the KeyError you encountered above. The KeyError is resulting from the missing 'tbase' key, which I actually cannot figure out where that key value pair should have be initialized from. The purpose of the code that is being executed is to reindex a very sparse timeseries. On line 69, there is a function which determining the frequency of the input timeseries index. I'm guessing for sparse/irregular timeseries that reports back a None object. However in the case of test 10 that function is currently reporting back a None object for regular timeseries as well, which is what is triggering the reindexing. I commented out the reindexing for now because of the missing key issue, and we can reevaluate the necessity of this functionality when we develop the IO abstraction class.
Also in my testing, I found the output of the updated WDMReader was shifting all timesteps by one timestep. I added commit 2014cd0 which fixes this issues. That was causing some indexing issues in the some of the model modules, but hopefully that will no longer be the case.
@ptomasula, thanks for those fixes!
When I run tests/test10/HSP2results/TEST10_hsp2_compare.ipynb from the develop_readWDM
branch, when running:
main(hdfname, saveall=True)
I get this output, with and error in the GQUAL module:
2021-04-08 11:11:20.36 Processing started for file test10_hsp2_dev2WDM.h5; saveall=True
2021-04-08 11:11:24.79 Simulation Start: 1976-01-01 00:00:00, Stop: 1977-01-01 00:00:00
2021-04-08 11:11:24.79 PERLND P001 DELT(minutes): 60
2021-04-08 11:11:24.86 SNOW
2021-04-08 11:11:24.94 PWATER
2021-04-08 11:11:24.99 PSTEMP
2021-04-08 11:11:25.05 PWTGAS
2021-04-08 11:11:25.13 RCHRES R001 DELT(minutes): 60
2021-04-08 11:11:25.23 HYDR
2021-04-08 11:11:25.30 ADCALC
2021-04-08 11:11:25.43 CONS
2021-04-08 11:11:25.48 HTRCH
2021-04-08 11:11:25.55 SEDTRN
2021-04-08 11:11:25.75 GQUAL
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-bdaf3722c4f1> in <module>
1 # %%prun -l 60 -T NumbaProfile.txt -q
----> 2 main(hdfname, saveall=True)
~/Documents/Python/limno.HSPsquared/HSP2/main.py in main(hdfname, saveall, jupyterlab)
118
119 ############ calls activity function like snow() ##############
--> 120 errors, errmessages = function(store, siminfo, ui, ts)
121 ###############################################################
122
~/Documents/Python/limno.HSPsquared/HSP2/GQUAL.py in gqual(store, siminfo, uci, ts)
978 if nexits > 1:
979 for n in range(1, nexits):
--> 980 toqal[n] = odqal[n] + tosqal[n]
981
982 if avdepe > 0.17: # simulate decay on suspended sediment
TypeError: 'float' object is not subscriptable
When I compare the HDF5 input files created in new branch, I found a difference in the types of Time Series values.
In the develop_readWDM
branch, the HDF5 TS input values are Float64
:
/TIMESERIES/TS039/table (Table(8784,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values": Float64Col(shape=(), dflt=0.0, pos=1)}
byteorder := 'little'
chunkshape := (4096,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"values": Index(6, medium, shuffle, zlib(1)).is_csi=False}
In the develop
branch, the HDF5 TS input values are Float32
:
/TIMESERIES/TS039/table (Table(8784,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values": Float32Col(shape=(), dflt=0.0, pos=1)}
byteorder := 'little'
chunkshape := (5461,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"values": Index(6, medium, shuffle, zlib(1)).is_csi=False}
Note that Float64
are used for TS values in the /TIMESERIES/Saturated_Vapor_Pressure_Table
without a problem on develop, so the issue is probably with the GQUAL module.
@PaulDudaRESPEC, do you have a quick idea/fix for the error in GQUAL
when the TIMESERIES values are Float64
rather than Float32
? For details, see https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-815972090 above.
@ptomasula, I found another difference in the files. The Stop
datetimes in the /TIMESERIES/SUMMARY
table are shorter by one time step with your code, even when the series length is the same. For example:
From the develop_readWDM
branch:
TS | Start | Stop | Freq | Length |
---|---|---|---|---|
TS039 | 1976-01-01 00:00:00 | 1976-12-31 23:00:00 | 1H | 8784 |
From the develop
branch:
TS | Start | Stop | Freq | Length |
---|---|---|---|---|
TS039 | 1976-01-01 00:00:00 | 1977-01-01 00:00:00 | 1H | 8784 |
@ptomasula, I've compared every other value in every table and they are otherwise completely identical.
There is only one minor exception: the row indices are in a slightly different order for the two versions of the '/RCHRES/RQUAL/FLAGS' table. I doubt that maters.
So I'm suspecting that the wrong stop time might be the cause of the errors.
@aufdenkampe Thanks for undertaking the comparison of the two files and identifying both of those differences.
I haven't look at the code enough to know how the '/RCHRES/RQUAL/FLAGS' table is used in the main logic, but I'd agree that a difference in the row indices is probably not the likely culprit.
I'll track down the cause of the difference in the timeseries stop dates and we'll see if that resolves it.
@ptomasula, thanks for that. I've been doing a bit of code exploring today, and I found that the get_uci(store)
function gets the run time for the model from the /CONTROL/GLOBAL
table from the UCI, which for both HDF5 input files has this:
index | Info |
---|---|
Comment | Version 11 test run: PERLND and IMPLND w/ RCHR... |
Start | 1976-01-01 00:00 |
Stop | 1977-01-01 00:00 |
So the Stop time mismatch might be the issue.
@aufdenkampe I made a small fix to GQUAL just now for a subscript out of range problem when nexits > 1. Not sure if that was the root of the problem being discussed here, but it sure wasn't helping!
@PaulDudaRESPEC, thanks for your fix 98d44f0!
I pulled your fix into our develop_readWDM
branch, but unfortunately, we still got the same error when running from our branch:
2021-04-09 11:50:24.66 Processing started for file test10_hsp2_dev2WDM.h5; saveall=True
2021-04-09 11:50:29.62 Simulation Start: 1976-01-01 00:00:00, Stop: 1977-01-01 00:00:00
2021-04-09 11:50:29.63 PERLND P001 DELT(minutes): 60
2021-04-09 11:50:31.65 SNOW
2021-04-09 11:50:33.32 PWATER
2021-04-09 11:50:33.40 PSTEMP
2021-04-09 11:50:33.46 PWTGAS
2021-04-09 11:50:33.53 RCHRES R001 DELT(minutes): 60
2021-04-09 11:50:33.66 HYDR
2021-04-09 11:50:33.94 ADCALC
2021-04-09 11:50:34.12 CONS
2021-04-09 11:50:34.18 HTRCH
2021-04-09 11:50:34.25 SEDTRN
2021-04-09 11:50:34.44 GQUAL
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-bdaf3722c4f1> in <module>
1 # %%prun -l 60 -T NumbaProfile.txt -q
----> 2 main(hdfname, saveall=True)
~/Documents/Python/limno.HSPsquared/HSP2/main.py in main(hdfname, saveall, jupyterlab)
118
119 ############ calls activity function like snow() ##############
--> 120 errors, errmessages = function(store, siminfo, ui, ts)
121 ###############################################################
122
~/Documents/Python/limno.HSPsquared/HSP2/GQUAL.py in gqual(store, siminfo, uci, ts)
978 if nexits > 1:
979 for n in range(0, nexits-1):
--> 980 toqal[n] = odqal[n] + tosqal[n]
981
982 if avdepe > 0.17: # simulate decay on suspended sediment
TypeError: 'float' object is not subscriptable
@ptomasula, can you update readWDM.py
to add 1-extra time step to the Stop time in the /TIMESERIES/SUMMARY
table, as shown in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-816081613? Thanks!
@aufdenkampe Thanks again for doing that testing. My latest commit has a fix for the stop time mismatch issues in the TIMESERIES/SUMMARY table. I added logic to have the group processing function return the group end datetime of the last group processed. The stop parameter now matches those generated from the reader in the develop branch.
Unfortunately, when I reran test10 I still get the same float indexing error in the GQUAL module. Notably, the resultant h5 file from my test run does not have the /CONTROL/GLOBAL table populated so that could be part of the issue. Would you be able to try the latest commit and see if you have the same issue.
@ptomasula, ran HSPsquared/tests/test10/HSP2results/TEST10_hsp2_compare.ipynb with your your new commit, and I got this new error when running readWDM(wdmname, hdfname)
:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-02f71edae935> in <module>
1 # %time
----> 2 readWDM(wdmname, hdfname)
~/Documents/Python/limno.HSPsquared/HSP2tools/readWDM.py in readWDM(wdmfile, hdffile, compress_output)
129 records = np.asarray(records)
130 offsets = np.asarray(offsets)
--> 131 dates, values = _process_groups(iarray, farray, records, offsets, tgroup)
132
133 dates, values, stop_datetime = _process_groups(iarray, farray, records, offsets, tgroup)
TypeError: cannot unpack non-iterable NoneType object
I hadn't seen that error previously.
Whoops. Thanks @aufdenkampe for catching that my commit should have replaced line 131 in the above with the line 133. Looks like a few deletions were left out of my pervious commit. I added commit ae374f1f which should fix those lines.
@ptomasula, that commit fixed the readWDM
error I had in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-818264729.
I have confirmed that the Stop
datetimes in the /TIMESERIES/SUMMARY
table is identical for the HDF5 files created on the develop
and develop_readWDM
branches. So this also fixed the issue described in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-816081613.
Unfortunately, I'm also still getting the TypeError: 'float' object is not subscriptable
error at the GQUAL
module when running main(hdfname, saveall=True)
on the input HDF5 file created from the latest on the develop_readWDM` branch (as shown in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-815972090).
So, maybe it is one of these remaining differences:
BENRFG
(benthic release flag) values for each reach.float32
from develop
vs float64
from develop_readWDM
.@PaulDudaRESPEC, any ideas?
I just made another change in the our develop branch to ADCALC.py, which should resolve the issue with "'float' object is not subscriptable" in GQUAL. However, that error appears to be a bit of a red herring -- just a symptom of the problem, and I suspect it doesn't fix the underlying issue. Remember this all works fine with Test10 in RESPEC's develop branch. So it seems there's still an issue in the LimnoTech changes to readWDM.py. Hopefully this fix gives you a more revealing error message.
@PaulDudaRESPEC, your commit c01199ac781c5fb3f7d58fcd96cd6c69d759b4f4 worked to get HSP2 to run without errors with HDF5 files from new readWDM
.
I still need to compare outputs!
@aufdenkampe Right, the fact that you hit this error in GQUAL tells me that you are in a different part of the conditional that I never get to in Test10, so that raises a concern.
@PaulDudaRESPEC, we agree. There is still an issue here that we need to work out. @ptomasula and I just discussed a plan to figure this out. In that process, we may locally "turn off" your recent fix (https://github.com/LimnoTech/HSPsquared/commit/c01199ac781c5fb3f7d58fcd96cd6c69d759b4f4) to make sure we can get our input files to pass.
I'm back to testing the develop_readWDM
branch.
@ptomasula and I recognized that it might be worth reversing his commit b0edc395b7a97d794e1dbbc742827f1e91184e4a to "Remove sparse timeseries fill" that he mentions in https://github.com/LimnoTech/HSPsquared/issues/21#issuecomment-811136052 above. He commented out those lines of code because it didn't appear that Test 10b was encountering it.
I just un-commented those lines with commit 479e6d6, and I get again get a similar KeyError: 'tbase'
, but now in the PERLND
module:
main(hdfname, saveall=True)
2021-04-23 08:18:20.34 Processing started for file test10_hsp2_dev2WDM_5.h5; saveall=True
2021-04-23 08:18:25.75 Simulation Start: 1976-01-01 00:00:00, Stop: 1977-01-01 00:00:00
2021-04-23 08:18:25.75 PERLND P001 DELT(minutes): 60
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-bdaf3722c4f1> in <module>
1 # %%prun -l 60 -T NumbaProfile.txt -q
----> 2 main(hdfname, saveall=True)
~/Documents/Python/limno.HSPsquared/HSP2/main.py in main(hdfname, saveall, jupyterlab)
49
50 # now conditionally execute all activity modules for the op, segment
---> 51 ts = get_timeseries(store,ddext_sources[(operation,segment)],siminfo)
52 flags = uci[(operation, 'GENERAL', segment)]['ACTIVITY']
53 if operation == 'RCHRES':
~/Documents/Python/limno.HSPsquared/HSP2/main.py in get_timeseries(store, ext_sourcesdd, siminfo)
204 if row.MFACTOR != 1.0:
205 temp1 *= row.MFACTOR
--> 206 t = transform(temp1, row.TMEMN, row.TRAN, siminfo)
207
208 tname = f'{row.TMEMN}{row.TMEMSB}'
~/Documents/Python/limno.HSPsquared/HSP2/utilities.py in transform(ts, name, how, siminfo)
78 pass
79 elif tsfreq == None: # Sparse time base, frequency not defined
---> 80 ts = ts.reindex(siminfo['tbase']).ffill().bfill()
81 elif how == 'SAME':
82 ts = ts.resample(freq).ffill() # tsfreq >= freq assumed, or bad user choice
@PaulDudaRESPEC, can you tell us a bit more about how tbase
is filled out from the inputs?
A few observations: When I run test10 I never get to line 80 of utilities.py. Also, there is no 'tbase' in siminfo. Maybe instead of 'tbase' that should be 'tindex' in line 80? But that fact that you get to this line at all suggests that the timeseries you've written to the HDF5 file using readWDM.py are different than the ones written using the code in master. I would expect that you could compare the contents of your HDF5 file with the one written by master (using something like HDFView) and see how yours is different.
@aufdenkampe and @ptomasula , I think I've discovered the root of the problem here. In your version of readWDM, 'freq' never gets set. Try adding these lines of code at line 137 of readWDM:
if tstep == 1:
series.index.freq = freq[tcode]
else:
series.index.freq = str(tstep) + freq[tcode]
I don't know if that solves your issue with sparse timeseries, but it will get your test10 to run!
@aufdenkampe @PaulDudaRESPEC Thanks Paul!
Your observation and proposed fix above is spot on. Great work. The tip off for us should have been line 69 of utilities.py which is pulling ts.index.freq. When that comes up as a None object we end up on Line 80. Thanks for finding that.
I implemented commit b9e635f which has an equivalent to your proposed fix and test10 now seems to execute without issue.
@PaulDudaRESPEC, that worked! Thank you for finding that! @ptomasula is writing more details now. In the meanwhile, I'll start preparing a pull request with our better binary time series reading capabilities for readWDM.py
. I see that @TongZhai is making lots of contributions to your develop
branch and we want his work to build on our contributions.
I found a new issue, when running the Calleg test on the develop_readWDM
branch that doesn't occur on develop
:
readWDM(wdmname, hdfname)
401 reading from wdm
402 reading from wdm
403 reading from wdm
404 reading from wdm
405 reading from wdm
406 reading from wdm
407 reading from wdm
408 reading from wdm
301 reading from wdm
322 reading from wdm
303 reading from wdm
304 reading from wdm
305 reading from wdm
306 reading from wdm
307 reading from wdm
312 reading from wdm
313 reading from wdm
1802 reading from wdm
1005 reading from wdm
1011 reading from wdm
776 reading from wdm
780 reading from wdm
781 reading from wdm
782 reading from wdm
800 reading from wdm
802 reading from wdm
803 reading from wdm
805 reading from wdm
806 reading from wdm
841 reading from wdm
778 reading from wdm
830 reading from wdm
831 reading from wdm
832 reading from wdm
833 reading from wdm
834 reading from wdm
836 reading from wdm
838 reading from wdm
839 reading from wdm
1169 reading from wdm
1171 reading from wdm
1239 reading from wdm
1000 reading from wdm
49 reading from wdm
141 reading from wdm
154 reading from wdm
175 reading from wdm
177 reading from wdm
187 reading from wdm
188 reading from wdm
234 reading from wdm
238 reading from wdm
249 reading from wdm
259 reading from wdm
168 reading from wdm
169 reading from wdm
190 reading from wdm
193 reading from wdm
194 reading from wdm
196 reading from wdm
227 reading from wdm
9306 reading from wdm
250 reading from wdm
321 reading from wdm
302 reading from wdm
323 reading from wdm
324 reading from wdm
325 reading from wdm
8700 reading from wdm
7021 reading from wdm
9202 reading from wdm
9101 reading from wdm
9809 reading from wdm
8300 reading from wdm
7029 reading from wdm
7020 reading from wdm
7053 reading from wdm
7065 reading from wdm
9501 reading from wdm
7057 reading from wdm
7011 reading from wdm
8100 reading from wdm
9402 reading from wdm
311 reading from wdm
9909 reading from wdm
6141 reading from wdm
6175 reading from wdm
6177 reading from wdm
6190 reading from wdm
6194 reading from wdm
6049 reading from wdm
6238 reading from wdm
6250 reading from wdm
6001 reading from wdm
6259 reading from wdm
9701 reading from wdm
7034 reading from wdm
9502 reading from wdm
242 reading from wdm
9916 reading from wdm
7001 reading from wdm
335 reading from wdm
9913 reading from wdm
7017 reading from wdm
9207 reading from wdm
9504 reading from wdm
9702 reading from wdm
9804 reading from wdm
9109 reading from wdm
9906 reading from wdm
---------------------------------------------------------------------------
OutOfBoundsDatetime Traceback (most recent call last)
<ipython-input-4-22674173fb80> in <module>
1 get_ipython().run_line_magic('time', '')
----> 2 readWDM(wdmname, hdfname)
~/Documents/Python/limno.HSPsquared/HSP2tools/readWDM.py in readWDM(wdmfile, hdffile, compress_output)
134 dates = np.array(dates)
135 dates_converted = date_convert(dates, date_epoch, dt_year, dt_month, dt_day, dt_hour, dt_minute, dt_second)
--> 136 series = pd.Series(values, index=dates_converted)
137 series.index.freq = str(tstep) + freq[tcode]
138
~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
279
280 if index is not None:
--> 281 index = ensure_index(index)
282
283 if data is None:
~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in ensure_index(index_like, copy)
5915 index_like = copy_func(index_like)
5916
-> 5917 return Index(index_like)
5918
5919
~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
293 from pandas import DatetimeIndex
294
--> 295 return _maybe_asobject(dtype, DatetimeIndex, data, copy, name, **kwargs)
296
297 elif is_timedelta64_dtype(data_dtype) or is_timedelta64_dtype(dtype):
~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in _maybe_asobject(dtype, klass, data, copy, name, **kwargs)
6169 return index.astype(object)
6170
-> 6171 return klass(data, dtype=dtype, copy=copy, name=name, **kwargs)
6172
6173
~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in __new__(cls, data, freq, tz, normalize, closed, ambiguous, dayfirst, yearfirst, dtype, copy, name)
305 name = maybe_extract_name(name, data, cls)
306
--> 307 dtarr = DatetimeArray._from_sequence_not_strict(
308 data,
309 dtype=dtype,
~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in _from_sequence_not_strict(cls, data, dtype, copy, tz, freq, dayfirst, yearfirst, ambiguous)
324 freq, freq_infer = dtl.maybe_infer_freq(freq)
325
--> 326 subarr, tz, inferred_freq = sequence_to_dt64ns(
327 data,
328 dtype=dtype,
~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in sequence_to_dt64ns(data, dtype, copy, tz, dayfirst, yearfirst, ambiguous)
1994 data = getattr(data, "_data", data)
1995 if data.dtype != DT64NS_DTYPE:
-> 1996 data = conversion.ensure_datetime64ns(data)
1997
1998 if tz is not None:
pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.ensure_datetime64ns()
pandas/_libs/tslibs/np_datetime.pyx in pandas._libs.tslibs.np_datetime.check_dts_bounds()
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-05-05 00:00:00
@aufdenkampe , looks like a problem reading data set number 9906. Somewhere around the 3429th value, the dates read are incorrect, getting values far into the future.
9906.txt
Here's what that data set number looks like when dumped to the seq format.
@PaulDudaRESPEC, thanks for that sequence file for the 9906 timeseries. @ptomasula and I figured some of this out on Friday, and Paul T. found the the likely issue in his code. He should be working on the fix today. The sequence file will help confirm the fix.
@PaulDudaRESPEC @aufdenkampe
I added commit ee429de to resolve the issue Anthony reported when running the Calleg test file. Series 9906 has a block that ends on the 511th element of a record and the code was trying to process the 512th element (which was an invalid block control word). We actually had fixed this behavior previously, but it looks like the code might have been lost in refactoring.
After this commit Test10 and Calleg can both been read by the WDMReader and the model executed. However when I tested Brandan's RPO_SWMM48LINKS2017_wCBOD_June2019.wdm the reader encountered a ValueError: Inferred frequency None from passed values does not conform to passed frequency 15T
when trying to assign the series.index.freq
of the datetime index. The freq attribute is intended to describe the spacing between timesteps in a datetime index, so I believe the value error is the results of trying to define a frequency for an irregular timeseries. We also know from previous testing that not having freq attribute set (i.e. freq=None) causes issues with some of the model modules.
We could use the resample method of pandas to generate a regular timeseries, but I need a better understanding on how HSPF handles irregular timeseries before I'd recommend that solution. I think that HSPF has some different options for filling in gaps between timeseries. Maybe we can have a meeting to best discuss how to handle irregular timeseries?
@ptomasula , I have some availability this week if you'd like to schedule something. Maybe Thursday or Friday morning?
@ptomasula, @PaulDudaRESPEC, @TongZhai and I just decided to complete this issue thread with a Pull Request, even if we still don't have a fix for irregular time series.
I just created Handle irregular time series input #51 to track that work.
Closing this issue as we will merge PR #35 as soon as we resolve a merge conflict.
From Jack in Feb 5 email:
Paul D. thinks: "This vb code seems to handle your complex WDM data correctly, so if we could just translate this vb to python we should be golden."
Last, @ptomasula created this EFDC fortran binary reader in Python, which could be worth following as an example designed to efficiently create Pandas data frames. https://github.com/LimnoTech/LTPy/blob/master/ltpy/ltpy/model_tools/efdc/binary_reader.py