NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, I want the example scripts (bash and python) to demonstrate use of the data-direct/streaming capability #274

Open epag opened 4 weeks ago

epag commented 4 weeks ago

Author Name: Jesse (Jesse) Original Redmine Issue: 95993, https://vlab.noaa.gov/redmine/issues/95993 Original Date: 2021-09-08 Original Assignee: Hank


None

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-09T11:36:35Z


I'm going to try to work up an example this morning, since I've already completed the meeting agenda.

I'm going to work from @/home/ISED/wres/wresTestData/issue95993@ to develop and test the script, starting from the example that is in the repository. I'll then put my changes back into the repo once I'm done, but will leave the directory in place as an area to test it.

I'm also going to modify the script to use the proxy URL, since that is where we are going to direct our users, eventually.

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-09T11:55:36Z


I note that I am going for this from #94510-148:

To be more specific, maybe the two example scripts wres_http_example.sh and wres_http_example.py should have the capability of both project-from-memory and project-from-file as well as data-generated-in-memory and data-from-file instead of having two or three bash separate bash scripts that aren't represented in python scripts.

If implemented in a single script, this will complicate it significantly. Specifically, the .sh will need to include an if-clause that either loads the declaration from an internal statement or from a file. But more, the declaration will necessarily be different between the data-generated-in-memory (which I take to mean data posted directly) and data-from-file. So there are four total possibilities for specifying the declaration:

  1. Declaration in script, data from file (current example)
  2. Declaration in script, data posted directly (declaration changes to remove sources)
  3. Declaration in file, data from file (read the appropriate .xml file co-located with the script)
  4. Declaration in file, data posted directly (read a different .xml co-located with the script)

I'm going to start with 1 and 2 and see how convoluted it becomes to put two declarations in the XML.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-09T19:13:45Z


Jesse,

When you have a few minutes, can you take a look at this script and give me your thoughts? I'm attempting to accomplish these two in one script:

  1. Declaration in script, data from file (current example)
  2. Declaration in script, data posted directly (declaration changes to remove sources)

The script is long, but that's primarily because of the verbose descriptions of each step. The script works when tested and it uses the proxy URL. Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-09T19:14:54Z


Oops... failed to provide the location of the script:

@/home/ISED/wres/wresTestData/issue95993/wres_http_example.sh@

To be clear, this is not a high priority, so no rush.

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-09-09T19:25:49Z


Taking a look

epag commented 4 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-09-09T19:41:09Z


For the multiple "right" dataset posts, I think it could be a list and iterated over.

I would pick the same order for the conditionals regarding "posting data", e.g. always have @if [ $POST_DATA_DIRECTLY = "true" ]@ first and the @else@ second.

What follows is a bit of a bigger departure, but I would eventually like to see the data in a variable or generated by the calling script on the fly to emphasize that files are not needed. This would be easier in python than bash. Maybe heredocs would be fine for that too.

Also a bigger departure would be to have the conditional "use files" versus "use direct data", such that you could see the example of using the heredoc directly versus having it read the data, for both the project declaration and data files.

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-09T19:56:14Z


Jesse wrote:

For the multiple "right" dataset posts, I think it could be a list and iterated over.

I was wondering about that and whether its clearer to post them explicitly one at a time. Small change, regardless.

I would pick the same order for the conditionals regarding "posting data", e.g. always have @if [ $POST_DATA_DIRECTLY = "true" ]@ first and the @else@ second.

Good point. Thanks for catching.

What follows is a bit of a bigger departure, but I would eventually like to see the data in a variable or generated by the calling script on the fly to emphasize that files are not needed. This would be easier in python than bash. Maybe heredocs would be fine for that too.

Again, good point. When I move onto Python (which will be an adventure give my limited experience), I'll see if I can work up that example. I can also add a comment in the script mentioning that the data need not be posted from a file.

Also a bigger departure would be to have the conditional "use files" versus "use direct data", such that you could see the example of using the heredoc directly versus having it read the data, for both the project declaration and data files.

To make sure I'm understanding what you mean by heredoc in this context, is the script treatment of the declaration, where it is embedded in the script, considered a heredoc? From what I'm seeing when Googling the terminology, it is, but checking to make sure we have the same understanding. Anyway, including the data, itself, in the script would make it even longer, which might be fine given its already pretty long.

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T11:54:44Z


Hank wrote:

Jesse wrote:

For the multiple "right" dataset posts, I think it could be a list and iterated over.

I was wondering about that and whether its clearer to post them explicitly one at a time. Small change, regardless.

I would pick the same order for the conditionals regarding "posting data", e.g. always have @if [ $POST_DATA_DIRECTLY = "true" ]@ first and the @else@ second.

Good point. Thanks for catching.

Changes have been made to the script to address the above.

As for the other comments, I'm waiting to make sure I understand what is meant by heredocs in this context. I believe you are saying you want the content of the data files included directly in the script and referenced via a variable, following how the declaration is currently handled. This will make the script longer and uglier, but it will also make it completely self contained and not tied to external files (hopefully making it clear that data can be posted directly to the service without actually creating it as a local file, first).

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-09-10T12:05:39Z


I am pretty sure that is the intention, yes. I would just make the example use a tiny amount of data, say 2 pairs, or break the data generation into a separate data generating function - it could be fake data.

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T12:35:23Z


It currently is fake data. The files are ones we use in system testing:

1985043012_DRRC2FAKE1_forecast.xml 1985043013_DRRC2FAKE1_forecast.xml 1985043014_DRRC2FAKE1_forecast.xml DRRC2QINE_FAKE_19850430.xml

I'd like to continue the theme of using data that is vetted, but that's 277 lines of data. I could shrink it, I guess.

Another possibility would be to include the file contents at the bottom of the script, in an appendix so to speak, so its not in the way. But I'm not sure if @bash@ allows for that. I'll look it up.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-09-10T13:03:00Z


I see, then I would probably just create a data generating function that creates that data, else choose a different example (agree that it is nice to re-use data, though). Anyway, I probably wouldn't inline all that crap into the function that does the api interaction. I don't think there's any way to return a string from a bash function, only an integer, so that creates some ugliness with setting a globally-scoped variable or something. Will be much cleaner in python.

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T13:26:32Z


James wrote:

I don't think there's any way to return a string from a bash function, only an integer, so that creates some ugliness with setting a globally-scoped variable or something.

I would just call a function to set a variable, just as is done with the current declaration, and then refer to that variable in the @curl@ calls. I have no problem with that approach if I can get the data out of the way, pushed to the bottom of the script. However, I think scripts are processed sequentially, top-to-bottom, so that isn't possible. Again, I'll do some internet searching to confirm, just haven't had time yet.

Will be much cleaner in python.

Agreed.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-09-10T15:15:15Z


Hmmm, perhaps I am not following, but don't you just want a composition of functions? I don't see where ordering comes in, providing you call the composition after it is defined.

@script.sh@

#!/bin/bash

main() {
     echo "main called"
     bar
     foo
}

foo() {
   echo "foo called"
}

bar() {
   echo "bar called"
}

main

I mean, you couldn't put the last @main@ upfront.

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-09-10T15:15:50Z


I would expect the above to produce:

main called
bar called
foo called
epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T15:34:20Z


providing you call the composition after it is defined.

That's the trick. If my goal is to move the variables holding the data out of the way, then I need those variables defined in a function at the bottom of the script. I would then refer to that function at the top of the script. So, indeed, I would need to put the main upfront or my goal (of pushing the clutter to the end) is not achieved.

I'm just worried a viewer of the example will open it, see the clutter of tons of data, be annoyed, not want to scroll down to the find the start of the actual example, and then close it. Perhaps that worry is unfounded.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T15:36:11Z


Oh, I see what you are saying. Put the interesting stuff in the main, then define the data, then call main at the bottom.

Got it,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T15:36:55Z


Let me edit the script to follow that design and include the data as a heredoc.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-09-10T15:51:32Z


Right, the last call that kicks off the sequence is just a detail, not really important for understanding, although you could make a note upfront, if that helps. The body of work is inside each function. With suitably named functions that describe what they are doing, I think it would work.

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T17:20:20Z


Having a hard time structuring the @curl@ command.

Before, this command would past the contents of the referenced file:

post_result=$(curl -i --cacert $wres_ca_file -F data=@DRRC2QINE_FAKE_19850430.xml $job_location/input/left | tr -d '\r')

Great. Now I'm trying to post the content of a variable. I defined the variable @observation_data@ and tried this:

post_result=$(curl -i --cacert $wres_ca_file -F data="${observation_data}" $job_location/input/left | tr -d '\r')

I know the contents of the file is within the variable, because I see the following message (note XML snippet which is the beginning of the content of the file):

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (26) couldn't open file "?xml version="1.0" encoding="UTF-8"?>
<TimeSeries xmlns="http://www.wldelft.nl/fews/PI" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.wldelft.nl/fews/PI http://fews.wldelft.nl/schemas/version1.0/p

I'm not sure why its trying to open it, since I don't use '@'.

I need to figure out how to post the contents of the variable to the COWRES. I've already done quite a bit of internet searching, but will continue. If anyone spots the problem, let me know.

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T17:35:38Z


Might be a single-quote/double-quote thing.

Using double qoutes, the @curl@ command tried to process the content of ${observation_data} which starts with the opening XML '<'. That tells it to load data from a file; hence the reported file not found error.

When I switch to single-quotes, a different error occurs that I am now investigating.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T17:38:36Z


Ah, with single-quotes, nothing is evaluated, including the content of the variable. The file, production side, looks like this:

::::::::::::::
2009077045053624378_1361704173823595342
::::::::::::::
${observation_data}

So how to get it to process ${observation_data} without @curl@ then attempting to interpret the contents of the variable? Hmmm...

Gotta love bash!

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T17:47:20Z


If I include a space before the content of the variable,

post_result=$(curl -i --cacert $wres_ca_file -F data=" ${observation_data}" $job_location/input/left | tr -d '\r')

the data is posted correctly, but that space causes WRES to not recognize it:

2021-09-10T17:48:11.345+0000 WARN DataSource Found text/plain document but it did not appear to be NWS datacard: 'file:///mnt/wres_share/input_data/346543667936158976_11648478576821920987'

So I need the first character in the file to be '<' but in such a way that @curl@ doesn't attempt to interpret the '<'.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-10T17:52:11Z


Found it. Sheesh.

Had to change @-F@ to @--form-string@:

@post_result=$(curl -i --cacert $wres_ca_file --form-string data="${observation_data}" $job_location/input/left | tr -d '\r')@

Use of --form-string prevents curl from parsing the contents.

I'm going to clean it up a bit and then wait for Jesse to review. I'm not sure I should post it to this ticket given some host information embedded in it. Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-13T15:06:15Z


Jesse:

Can you please review the example, again?

/home/ISED/wres/wresTestData/issue95993/wres_http_example.sh

I believe it satisfies what you recommended here:

Jesse wrote:

For the multiple "right" dataset posts, I think it could be a list and iterated over.

I would pick the same order for the conditionals regarding "posting data", e.g. always have @if [ $POST_DATA_DIRECTLY = "true" ]@ first and the @else@ second.

What follows is a bit of a bigger departure, but I would eventually like to see the data in a variable or generated by the calling script on the fly to emphasize that files are not needed. This would be easier in python than bash. Maybe heredocs would be fine for that too.

However, it does not satisfy this and I don't know that I want it to:

Also a bigger departure would be to have the conditional "use files" versus "use direct data", such that you could see the example of using the heredoc directly versus having it read the data, for both the project declaration and data files.

I like having the example script be completely self-contained. I know I originally went with files, but that was because I wasn't comfortable including the data in the script. Now that I've included it at the end of the script, not the beginning, I'm more comfortable. I mention that files can be referred to, instead, but I don't think having it run from files is necessary.

Thoughts?

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-09-13T16:00:50Z


Thanks for helping with this example, Hank. I am taking another look.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-09-13T16:24:57Z


I think the current leading comment could be removed or re-worked, and the original leading comment promoted to the top. The contents of the main function are important to the example so to say otherwise is confusing.

Indentation.

The @for@ loop doesn't need a counter, it can be @for timeseries in ${array}@ or whatever, then refer to @${timeseries}@ in the body.

I agree that it's become pretty long. And I agree when things become long then it is nicer to split them up into functions. Taking that to the logical conclusion, there are probably other blocks that can be split out into functions, such that the steps have names, and then the function calls at the bottom are an outline of the steps in order. I don't know if that's necessary though.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-13T17:05:18Z


I'll make the comment change and checkout options for structuring the for loop, though I don't think avoiding indexing is critical. As long as it works, is documented, and is somewhat understandable given its in @bash@.

I assume the indentation comment is a reference to the stuff in main not being indented an additional level. I still view main() as just noise, to be frank, and something I would rather not have done to avoid confusing the reader. That's why I didn't indent: I was hoping to deemphasize it. I'll go ahead and add it; just not sure it adds value for most readers.

I'd rather not break everything into functions. Just enough to move the data to the end is good enough, imho.

Thanks for looking it over! Getting closer,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-13T17:58:17Z


Changes made. Can't test it due to #96161.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-14T19:20:39Z


Almost forgot about this...

Proposed final version of the example script can be found here:

@/home/ISED/wres/wresTestData/issue95993/wres_http_example.sh@

If there are no objections, I'll push it to the repo tomorrow. I'll also use this as the example script that the RFCs use to confirm access. Note that its written to use the proxy and a .pem file located in a cacerts directory (as is the case in the repo, I believe). I'll remove the cacerts directory when I share it with the field, and just give them instructions to have the .pem file located in the same directory as the script. Something like that.

Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-15T11:45:26Z


Pushed in commit:28cc52f7165131d9fcb94ef7fbf6a201afd8e1ac. Checking the box. Leaving the ticket to the Backlog since there is still a Python script to update. I might take this as an opportunity to learn some Python, but not until after the training, I'm guessing, when I'll have some time to play around with it.

Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-15T12:02:52Z


The wiki was updated to include a link to the bash example:

https://vlab.noaa.gov/redmine/projects/wres-user-support/wiki/Posting_timeseries_data_directly_to_COWRES_as_inputs_for_a_WRES_job

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-09-16T16:50:38Z


Hank:

Change the default hostname in the example to localhost.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-09-16T16:52:50Z


Another thing I remembered is maybe it should have only the csv2 output selected. Then the curl command can be piped to gunzip or something to display results.