Closed HenryGeorgist closed 2 months ago
@HenryGeorgist @slawler just want to make sure I'm on the same page as you all.
Deliverable: a Docker-ized Python CLI application that, given a hydrograph, computes some basic statistics (avg, max, min, duration max, etc.)
Example output:
{
"max": 123.4,
"min": 12.3,
"avg": 23.4,
"duration_max": 21.0,
"duration": "3hr",
}
Is there anything WAT-specific that needs to be addressed?
@thwllms yeah thats about right, I started a java example here: https://github.com/HenryGeorgist/JavaPlugin
It is not super helpful at this point, it leaves alot of details out.
we need to give you a good model payload to start with.
@thwllms do you want to chat today to go over the needs for this plugin? I would happily help out if you have time.
@HenryGeorgist sure, I can chat today. I'm working up something here: https://github.com/water-tech-repos/wat-hydrograph-stats-py. The hydrograph_stats.py
script does just about everything we discussed above. Need to add a Dockerfile and tests. What do you think?
cool - lets talk. i am free except 11 am est
https://github.com/HenryGeorgist/HydrographScaler/blob/main/docker-compose.yml
this compose file may help in mocking s3 so you can test retrieval of the csv file from an s3 bucket.
cool - lets talk. i am free except 11 am est
Did you get my Teams meeting invite for 3pm Eastern today?
yes, i delined it, i have a train to catch at 3:30 can we do a bit earlier?
From: Thomas Williams @.> Sent: Tuesday, April 19, 2022 12:08 PM To: USACE/wat-api @.> Cc: Lehman, William P CIV USARMY CEIWR (USA) @.>; Mention @.> Subject: [URL Verdict: Neutral][Non-DoD Source] Re: [USACE/wat-api] create a python plugin (Issue USACE/cloudcompute#19)
cool - lets talk. i am free except 11 am est
Did you get my Teams meeting invite for 3pm Eastern today?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Oh, I didn't get the declined message. 1:30pm?
modelPayload.yml
target_plugin: hydrograph_stats
model_configuration:
model_name: stats
model_configuration_paths:
- /data/hydrographstats/stats.json
model_links:
linked_inputs:
- input:
- name: hydrograph
- parameter: flow
- format: .csv
source:
- name: /data/hydrographscaler/output/hsm1.csv
- parameter: flow
- format: .csv
required_outputs:
- name: summaryStatOutput
parameter: flow
format: .csv
event_config:
output_destination: /data/hydrographstats/output/
realization:
index: 1
seed: 1234
event:
index: 1
seed: 5678
time_window:
starttime: 2018-01-01T01:01:01.000000001-05:00
endtime: 2020-12-31T01:01:01.000000001-05:00
hsm1.csv
Time,Flow
2018-01-01 01:01:01.000000001 -0500 -0500,6.074918237944704
2018-01-01 02:01:01.000000001 -0500 -0500,6.301664797370323
2018-01-01 03:01:01.000000001 -0500 -0500,6.51896358348654
2018-01-01 04:01:01.000000001 -0500 -0500,6.745710142912159
2018-01-01 05:01:01.000000001 -0500 -0500,6.972456702337778
2018-01-01 06:01:01.000000001 -0500 -0500,7.199203261763397
2018-01-01 07:01:01.000000001 -0500 -0500,7.4259498211890165
2018-01-01 08:01:01.000000001 -0500 -0500,7.6526963806146355
2018-01-01 09:01:01.000000001 -0500 -0500,7.8699951667308525
2018-01-01 10:01:01.000000001 -0500 -0500,8.096741726156472
2018-01-01 11:01:01.000000001 -0500 -0500,8.32348828558209
2018-01-01 12:01:01.000000001 -0500 -0500,8.55023484500771
2018-01-01 13:01:01.000000001 -0500 -0500,8.776981404433329
2018-01-01 14:01:01.000000001 -0500 -0500,8.994280190549546
2018-01-01 15:01:01.000000001 -0500 -0500,9.221026749975165
2018-01-01 16:01:01.000000001 -0500 -0500,9.447773309400784
2018-01-01 17:01:01.000000001 -0500 -0500,9.447773309400784
2018-01-01 18:01:01.000000001 -0500 -0500,9.447773309400784
2018-01-01 19:01:01.000000001 -0500 -0500,9.447773309400784
2018-01-01 20:01:01.000000001 -0500 -0500,9.447773309400784
2018-01-01 21:01:01.000000001 -0500 -0500,9.447773309400784
2018-01-01 22:01:01.000000001 -0500 -0500,9.221026749975165
2018-01-01 23:01:01.000000001 -0500 -0500,8.994280190549546
2018-01-02 00:01:01.000000001 -0500 -0500,8.776981404433329
2018-01-02 01:01:01.000000001 -0500 -0500,8.55023484500771
2018-01-02 02:01:01.000000001 -0500 -0500,8.32348828558209
2018-01-02 03:01:01.000000001 -0500 -0500,8.096741726156472
2018-01-02 04:01:01.000000001 -0500 -0500,7.8699951667308525
2018-01-02 05:01:01.000000001 -0500 -0500,7.6526963806146355
2018-01-02 06:01:01.000000001 -0500 -0500,7.4259498211890165
2018-01-02 07:01:01.000000001 -0500 -0500,7.199203261763397
2018-01-02 08:01:01.000000001 -0500 -0500,6.972456702337778
2018-01-02 09:01:01.000000001 -0500 -0500,6.745710142912159
2018-01-02 10:01:01.000000001 -0500 -0500,6.556754676724143
2018-01-02 11:01:01.000000001 -0500 -0500,6.37724698384553
2018-01-02 12:01:01.000000001 -0500 -0500,6.188291517657514
2018-01-02 13:01:01.000000001 -0500 -0500,5.9993360514694976
2018-01-02 14:01:01.000000001 -0500 -0500,5.810380585281482
2018-01-02 15:01:01.000000001 -0500 -0500,5.621425119093466
2018-01-02 16:01:01.000000001 -0500 -0500,5.43246965290545
2018-01-02 17:01:01.000000001 -0500 -0500,5.252961960026837
2018-01-02 18:01:01.000000001 -0500 -0500,5.0640064938388205
2018-01-02 19:01:01.000000001 -0500 -0500,4.875051027650804
2018-01-02 20:01:01.000000001 -0500 -0500,4.686095561462789
2018-01-02 21:01:01.000000001 -0500 -0500,4.497140095274773
2018-01-02 22:01:01.000000001 -0500 -0500,4.3459757223243605
2018-01-02 23:01:01.000000001 -0500 -0500,4.194811349373948
2018-01-03 00:01:01.000000001 -0500 -0500,4.053094749732936
2018-01-03 01:01:01.000000001 -0500 -0500,3.9019303767825235
2018-01-03 02:01:01.000000001 -0500 -0500,3.7507660038321116
2018-01-03 03:01:01.000000001 -0500 -0500,3.5996016308816987
2018-01-03 04:01:01.000000001 -0500 -0500,3.448437257931286
2018-01-03 05:01:01.000000001 -0500 -0500,3.2972728849808735
2018-01-03 06:01:01.000000001 -0500 -0500,3.146108512030461
2018-01-03 07:01:01.000000001 -0500 -0500,2.9949441390800486
2018-01-03 08:01:01.000000001 -0500 -0500,2.8532275394390365
2018-01-03 09:01:01.000000001 -0500 -0500,2.702063166488624
2018-01-03 10:01:01.000000001 -0500 -0500,2.64537652663222
2018-01-03 11:01:01.000000001 -0500 -0500,2.588689886775815
2018-01-03 12:01:01.000000001 -0500 -0500,2.5320032469194103
2018-01-03 13:01:01.000000001 -0500 -0500,2.4753166070630055
2018-01-03 14:01:01.000000001 -0500 -0500,2.4186299672066007
2018-01-03 15:01:01.000000001 -0500 -0500,2.361943327350196
2018-01-03 16:01:01.000000001 -0500 -0500,2.305256687493791
2018-01-03 17:01:01.000000001 -0500 -0500,2.2485700476373864
@thwllms let me know if that helps with the file format and the model payload example
@HenryGeorgist thanks. A few questions:
hsm1.csv
meant to have two separate offsets? (-0500 -0500
) Pandas parses the timestamps fine regardless with offsets of -05:00
./data/hydrographstats/stats.json
is meant to contain configuration inputs for the plugin, e.g. duration? Correct?summaryStatOutput
has format: .csv
. time_window
have any effect on this plugin? Should the input hydrographs be filtered by the time window?@HenryGeorgist I've added integration tests, using Docker to test S3 and Azure Blob Storage connections with minio
and azurite
. Made some small tweaks to the WAT payload format above. To run the tests: ./integration-tests.sh
-- I don't think any setup should be required. Let me know what you think.
https://github.com/water-tech-repos/wat-hydrograph-stats-py
PS: please forgive the monolithic .py
file. Proof-of-concept... right? 😬
i wonder if we should start thinking more like uri instead of url
@HenryGeorgist by that do you mean some service between S3/redis/etc. and the plugin container which provides e.g. hydrographs to the plugin in a standardized way? That is, so plugins wouldn't worry about precisely where input resources are coming from.
well i was thinking that you prepend s3:\ to the path in the aws config, and abfs:\ to the azure one - if we separate the storage type from the location - it might benefit us
not that your interpretation of my comment isnt good - just not what I intended to say.
No worries. Is this the sort of thing you mean?
target_plugin: hydrograph_stats
model_configuration:
model_name: stats
model_configuration_paths:
- type: s3
bucket: mybucket
key: config_aws.yml
model_links:
linked_inputs:
- name: hydrograph
source:
type: azblob
container: mycontainer
blob: hsm1.csv
parameter: flow
format: .csv
required_outputs:
- name: summaryStatOutput
parameter: flow
format: .json
event_config:
output_destination:
type: redis
host: some.redis.host
key: task12345
realization:
index: 1
seed: 1234
event:
index: 1
seed: 5678
time_window:
starttime: 2018-01-01T01:01:01.000000001-05:00
endtime: 2020-12-31T01:01:01.000000001-05:00
yeah actually maybe something like that, that as you point out would fit for model configurations and for links
https://pkg.go.dev/go.lsp.dev/uri
im thinking about "scheme", "authority", "path", "query", and "fragment" or something like that
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
| _____________________|__
/ \ / \
urn:example:animal:ferret:nose
Gotcha. I know that type of URI is pretty standard for S3/etc (s3://bucket/thing.txt
) but I don't know if there's a similar established way of doing that for Redis or SQS? Not that it would be too hard to create something.
Edit: to be clear, there's a URI scheme for Redis databases but I don't believe that it permits referring to a specific key within the database in the same way. https://www.iana.org/assignments/uri-schemes/prov/redis
Edit 2: nevermind, I see this can be done for Redis pretty simply with the fragment portion.
Added Redis as a hydrograph source / results sink. https://github.com/water-tech-repos/wat-hydrograph-stats-py/pull/11
Looking into SQS next.
In your switch case you are hardcoding the cases - we may need to push that to env variables ultimately because our mock system schemes may be different (using elastic instead of sqs for instance)
Take a look at the wat api docker compose, i think i have set up services in a single yml for our testing - sorry i didnt share this earlier https://github.com/USACE/wat-api/blob/main/docker-compose.yml
@HenryGeorgist meaning something like this for the env variables?
def get_text(uri: str, fsspec_kwargs: dict = {}) -> str:
uri_parsed = urlparse(uri)
scheme = uri_parsed.scheme
if scheme == os.getenv('URI_SCHEME_REDIS', 'redis') or scheme == os.getenv('URI_SCHEME_REDISS', 'rediss'):
r = Redis.from_url(uri, decode_responses=True)
key = uri_parsed.fragment
text = r.get(key)
elif scheme == os.getenv('URI_SCHEME_HTTP', 'http') or scheme == os.getenv('URI_SCHEME_HTTPS', 'https'):
text = requests.get(uri).text
else:
with fsspec.open(uri, 'r', **fsspec_kwargs) as f:
text = f.read()
return str(text)
it isnt robust yet, but i was able to do this as a message today:
target_plugin: hydrograph_stats
plugin_image_and_tag: tbd/hydrographstats:v0.0.2
model_configuration:
model_name: hydrograph_stats
model_configuration_paths:
- /data/config_aws.yml
model_links:
linked_inputs:
- name: hsm.csv
parameter: flow
format: csv
resource_info:
scheme: s3?
authority: /data/realization_0/event_1
fragment: hsm.csv
required_outputs:
- name: results-wat.json
parameter: scalar
format: json
event_config:
output_destination: /data/realization_0/event_8
realization:
index: 0
seed: 4494286321627776427
event:
index: 8
seed: 3276075611334443242
time_window:
starttime: 2018-01-01T01:01:01.000000001Z
endtime: 2020-12-31T01:01:01.000000001Z
shoot -it looks like my output destination and my input authority are not in sync yet. i will get that fixed.
fixed it...
target_plugin: hydrograph_stats
plugin_image_and_tag: tbd/hydrographstats:v0.0.2
model_configuration:
model_name: hydrograph_stats
model_configuration_paths:
- /data/config_aws.yml
model_links:
linked_inputs:
- name: hsm.csv
parameter: flow
format: csv
resource_info:
scheme: how do i figure this out
authority: /data/realization_0/event_5
fragment: hsm1.csv
required_outputs:
- name: results-wat.json
parameter: scalar
format: json
event_config:
output_destination: /data/realization_0/event_5
realization:
index: 0
seed: 4494286321627776427
event:
index: 5
seed: 2830258753914485572
time_window:
starttime: 2018-01-01T01:01:01.000000001Z
endtime: 2020-12-31T01:01:01.000000001Z
i figured out a way to pass the fs config all the way down... not feeling awesome about it... but it works
target_plugin: hydrograph_stats
plugin_image_and_tag: tbd/hydrographstats:v0.0.2
model_configuration:
model_name: hydrograph_stats
model_configuration_paths:
- /data/config_aws.yml
model_links:
linked_inputs:
- name: hsm.csv
parameter: flow
format: csv
resource_info:
scheme: minio:9000/configs
authority: /data/realization_0/event_7
fragment: hsm1.csv
required_outputs:
- name: results-wat.json
parameter: scalar
format: json
event_config:
output_destination: /data/realization_0/event_7
realization:
index: 0
seed: 4494286321627776427
event:
index: 7
seed: 5559254042425429666
time_window:
starttime: 2018-01-01T01:01:01.000000001Z
endtime: 2020-12-31T01:01:01.000000001Z
@HenryGeorgist are /data/config_aws.yml
and /data/realization_0/...
meant to be in an S3 bucket called data
?
And per your email last week, output should be sorted in a redis key named like this?
tbd/hydrographstats:v0.0.2_wat-payload.yml_R0_E7
(?)
data/ is actually more like a postfix. the bucket should come in on the environment variables (i think my examples actually use /configs as the bucket name.
We are iterating on how to manage the task execution with a plugin container, and it seems we are now migrating away from lambda and towards batch. With batch we get some status reporting natively on the batch job, which may make the redis status cache less valueable (i am not certain though)
@HenryGeorgist I've made some updates to handle the YAML spec you posted above and to write status to a Redis key. Check out this integration test: https://github.com/water-tech-repos/wat-hydrograph-stats-py/blob/main/tests/new_aws_integration_test.py#L119
A little messier than I'd like, but hopefully this can integrate with what you've written so far. Let me know what you think.
Just took a look at this test and noticed the floating point comparison, which has bitten me in the past. @thwllms you might consider the pytest.approx
function just to be extra safe:
@slawler thanks for pointing that out. Updated the tests to use pytest.approx
.
The Python Plugin should be a simple calculation that summarizes a hydrograph into a series of optional statistics. The optional statistics are:
This link gives some context into how we currently compute the duration maximum from a dss file, only for reference. Duration Maximum