There are two possible sources for the data. We can grab the data from the production DWR machine, mrsbLAPP20662.pr.water.ca.gov or we can grab the data from the UCD CIMIS server using it's existing rsync service. Either way, we do not have direct access to any of these machines directly. We can however, setup secure tunnels through the old receiver machine. The UCD CIMIS choice is slightly safer, since we move the processing burden to the Davis machines, but the data itself will not be exactly what the DWR customers received originally.

Filezilla

DWR suggested using a intermediate storage on the U: drive and using something like filezilla to copt the data as an intermediate step. This is too long and arduous a task, so I won't do that.

DWR ssh rsync

An alternative to this is to create a SSH tunnel from the test machine to the production machine. We can do that in two ways, one is through the windows server using putty, and the second is via the old receiver. The mechanism is basically, the same, except for setting up the tunnel.

Windows PUTTY tunnel.

For this, you setup a remote tunnel initiated from putty on the windows machine. In addition to the connection, you create a remote tunnel where the port at 2222 (on the test/dev machine is tunneled to the production ssh port (22). The putty setup looks something like this:

Where you connect to the development server,

And the tunnel is setup, so that the remote port (local port on the development Server), forwards to the production ssh port:

It's probably safe to use this everytime you connect to the development machine, that is you can just keep those tunnels as part of your standard development login.

Then, when you want to use this tunnel, to rsync from the production server back to the development server, you use something like:

 rsync --exclude=.tmp --exclude=.bash_history --rsh='ssh -p 2222' qhart@localhost:/apps/cimis/gdb/cimis/2018-??-?? ~cimis/gdb15/cimis/ -a -v

Where the --rsh allows you to set the special localport. This works fine, and

Tunnel through receiver

The methodology is basically the same as the windows rsync service, the big difference is that we connect to the production ssh server actively from the dev/test server. So in this case we run

ssh -L 2222:mrsblapp20662.pr.water.ca.gov:22 qhart@mbrylapp20664.water.ca.gov

Now we have ssh access to the production machine, on localhost:2222. We can use this to do the same trick, and download the data from the production server using this connection, for example, then the same setup as above.

 rsync --rsh='ssh -p 222' --exclude=.bash_history qhart@localhost:/home/cimis/gdb/cimis/PERMANENT . -a -v

CSTARS rsync

The process for this is to open a ssh tunnel as root, and create a tunnel to the CSTARS processor in the rsync port, from there, we can (as the CIMIS user) make requests to the CIMIS server for the required data. Some initial tests, (see below) indicate that a month of data takes at most about 9 minutes to transfer, so we can expect to download the data in 16129/60=28.8 hrs, or about a day of running.

Methodology

To create the tunnel..

# As root
[root@dwrnprhapp0075 ~]# ssh -L 873:cimis.cstars.ucdavis.edu:873 qhart@mbrylapp20664.water.ca.gov

Now you have access from the development machine to the CSTARS rsync service.

# This is as the CIMIS user
>[[ -d ~/gdb15-cstars ]] || mkdir ~/gdb15-cstars
>cd ~/gdb15-cstars
 >time rsync -a -v rsync://localhost/pro/2019-01-?? .

sent 130,690 bytes  received 9,929,625,447 bytes  17,497,367.64 bytes/sec
total size is 10,616,592,951  speedup is 1.07

real    9m26.987s
user    0m45.058s
sys     0m43.287s

Now the data that is on the CSTARS server is bigger, but there are less files. So we will do the same process with a month the year before, both in January and in July

# This is as the CIMIS user
 >time rsync -a -v rsync://localhost/pro/2018-01-?? .

sent 466,721 bytes  received 3,445,977,406 bytes  7,248,042.33 bytes/sec
total size is 3,444,105,912  speedup is 1.00

real    7m55.431s
user    0m12.471s
sys     0m20.158s

 >time rsync -a -v rsync://localhost/pro/2018-01-?? .
sent 585,604 bytes  received 4,550,807,321 bytes  7,867,576.36 bytes/sec
total size is 4,548,437,522  speedup is 1.00

real    9m37.303s
user    0m16.674s
sys     0m26.549s

Expected Archive size

I didn't do a super complete test, but these numbers should be pretty accurate. If we look at the raw images from the GOES15 setup, we have a monthly use of about 115M/day from the GOES15 data days. That means about 3.5Gb/mo or 42Gb/yr or 840Gb for the entire historical data. Lots of this data however is not required for the WMS.CGI, and it can be replicated. For the GOES15 data we can

# While in grass at the location, so we iterate through the mapsets;
for ms in 20[01]?-??-??; do
  g.mapset $ms;
  g.remove -f type=raster pattern=*_9;
 g.remove -f type=raster pattern=[pnk][01]???
 g.remove -f type=raster pattern=day_*
 g.remove -f type=raster pattern=z_day_*
 g.remove -f type=rast name=ea_dewp_ns,Trb,Trd,B,Bc,Bk,D,Dc,Dk
done

When we do this, step we drop the size of the GOES-15 data down to about, 35Mb/day or 1.1Gb/mo or 13Gb/yr or 205Gb for the entire historical data. The data will be less recoverable from this dataset but no data should be lost that can't be computed.

Dev / Test / Prod archive sizes.

The Dev machine has about 270G of space so if we can use 20% for testing, we can have about 50Gb for GOES-15 data, or about 4 yrs for testing. That should be plenty.

The test machine has about the same amount of data, so we cannot use the test machine to recover all the data for the production machine. Too bad.

CSTARS / spatial-cimis

Test rsync methodologies on cimis-dev `dwrnprhapp0075.ad.water.ca.gov` #42