geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
60 stars 23 forks source link

check_services.py for run_operations #425

Open falkamelung opened 3 years ago

falkamelung commented 3 years ago

One reason why processing fails is because one of the data servers or $WORK is offline or slow. We need a script to check for the different services used. This script should be run in the workflow before a particular service is used and if down rerun after 1 or 5 hours. At the same time it can be run daily from cron and send an email if one service is down

We could use the timeout command. For all services we could run three commands using e.g. timeout 0.1 timeout 1 and timeout 5 cmd; echo $? returning ONLINE, SLOW, OFFLINE, respectively. Is the timeout command appropriate for everything or is there something better?

check_services.py --all    [[default]. (for download the default should  be downloadASF)

--demServer:  https://e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11
     example:      dem.py -a stitch --filling --filling_value 0 -b 27 34 99 111 -c -u https://e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11/
curl -n  -L -c $HOME/.earthdatacookie -b $HOME/.earthdatacookie -k -f -O https://e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11/S01W092.SRTMGL1.hgt.zip

--downloadASF
      (data download from the ASF)
      example:  ssara_federated_query.py --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=128 --intersectsWith='Polygon((-91.40 -1.00, -91.40 -0.60, -90.86 -0.60, -90.86 -1.00, -91.40 -1.00))' --start=2016-06-01 --end=2016-08-31 --print
https://web-services.unavco.org   for getting the listing, and 
https://datapool.asf.alaska.edu for download. (using wget I believe)
Sometimes the authorization service is down:
https://urs.earthdata.nasa.gov/oauth/authorize?client_id=BO_n7nTIlMljdvU6kRRB3g&response_type=code&redirect_uri=https://dy4owt9f80bz7.cloudfront.net/login

--jetstreamServer
        (data upload to centos@129.114.104.223, use environment variable $REMOTE_SERVER).  
        ( the upload is done using scp.  We could test whether a remote shell command is successful, eg
      ssh $REMOTE_SERVER ls .bashrc >> /dev/null; echo $?           )

--workDir
     --> timeout 2 ls  $WEATHER_DIR/ERA5/ERA5_N20_N40_E60_E80_20150608_01.grb >> /dev/null ; echo $?
returns 0, i.e. it is fine. In contrast:
timeout 0.001 ls  $WEATHER_DIR/ERA5/ERA5_N20_N40_E60_E80_20150608_01.grb >> /dev/null ; echo $?
returns 124

--queue
        check whether the queue $QUEUENAME is online and ready for processing
       I think sinfo -p skx-normal  works, but I am not sure. I have see If you don't find it 

Maybe:

--insarmaps
        check whether insarmaps.miami.edu is available for upload. The upload is done using the  
https://github.com/geodesymiami/insarmaps_scripts/blob/master/json_mbtiles2insarmaps.py with the command   ogr2ogr
I'll send you a json_mbtiles2insarmaps.py independently

Later:

check_services.py --downloadGEP
        (data download from ESA's GEP. @mirzaees , can you add the server?  )

check_services.py --ECMWF
     (service providing WEATHER MODELS)
     (I don't know at the moment, lets do this last)

I will add this to every step in minsarApp.bash. I.e to run check_services.py --downloadASF, capture the exit_code and start ssara_*.bash only if exit_code=0.

If one service is down (e.g. downloadASF, demServer), it should go in a waiting loop, try again after 5 hours and exit after 2 days . Can you think on how to implement this?

I think it should create a check_services.log and add an entry for each check.

It should display the commands that it runs to the screen. If that interferes with proper interpretation of the exit code (I don't think it does) we can have a --verbose option.

Here some ideas on how to check whether servers are online: https://www.2daygeek.com/linux-command-check-website-is-up-down-alive/

Another idea: can we display the results on a website? Sort-of a traffic light with 5 lights. If they are all green everything is online. We could have a mini-file uploaded to jetstream for that.

Ovec8hkin commented 3 years ago

You need to be significantly more specific here. What services are being checked? How do you check them? What constitutes them being "down"? etc.

falkamelung commented 3 years ago

The basic idea is to check whether the server (service) from which we download is online and whether downloading works. In many cases processing failures are the result of a server being down. The timeout command is one option to do this.

It is a bit difficult to test whether outages will be caught because the services are most of the time online. We will only know once an outage occurs.

Actually, we have control over jetstream and insarmaps server. On those we can temporarily close ports to check whether an outage is caught.

Ovec8hkin commented 3 years ago

This provided me no new information. Namely, you have failed to answer the major question here: how do you check if a given service if down or not? I want EXACTLY how you are checking whether or not EACH service is online or not. Provide server URLs, command line commands, etc. for EVERY service you want checked. Until you have provided that, I can not even begin to consider how to write this type of script.

mirzaees commented 3 years ago

@falkamelung I think this not practically easily doable. If the service is not working, even a simple 'ls' command would not work let alone running bash scripts for checking It happened to me several times that $WORK or $SCRATCH were in maintenance, and I could do almost nothing

Ovec8hkin commented 3 years ago

@mirzaees That is a valid concern. I presume that $WORK or $SCRATCH being down is the last common of the possible outages, so, to some degree, we could check most of the other servers. if $WORK or $SCRATCH are down, as you mentioned, you can't run anything anyway, so I don't think it really matters. Running code that needs $WORK/$SCRATCH will obviously fail if those services go down mid run, but there isn't anything we can about that.

falkamelung commented 3 years ago

Hi @mirzaees , for --workDir I verified that it works as suggested above:

timeout 0.01 ls  $WEATHER_DIR/ERA5/ERA5_N20_N40_E60_E80_20150608_01.grb >> /dev/null ; echo $?
0
//login3/work/05861/tg851601/stampede2/insarlab/WEATHER/ERA5[1060] timeout 0.001 ls  $WEATHER_DIR/ERA5/ERA5_N20_N40_E60_E80_20150608_01.grb >> /dev/null ; echo $?
124

For $SCRATCH it also works:

timeout 0.001 ls $SCRATCH >> /dev/null ; echo $?
124
//login3/scratch/05861/tg851601[1064] timeout 0.1 ls $SCRATCH >> /dev/null ; echo $?
0

Hi @Ovec8hkin , I added details about server as much as I had handy. Is that enough information? We will get the two missing ones asap (ECMWF and downloadGEP)

Ovec8hkin commented 3 years ago

@falkamelung I want you to write a bash script (or just a text file) that just runs one check after the other (however you verify each system is online) and prints out the status so I can see how you're checking each service (and I want you to verify that whatever checks you're using ACTUALLY do what you want). You still have not provided nearly enough detail for me to run any of these checks (for instance, what qualifies as ONLINE for each of these, what timeout should I use for each one, etc.), and Im not going to manually ask you for each one. When you've done that, I will take it and add the necessary error checking and rerunning.

falkamelung commented 3 years ago

I don't know which is an efficient way to check whether a download service is online. Here some commands that I would suggest for the first services (but a longer time for WORK and SCRATCH, specified as a variable). What is an efficient way for USGS and ASF? There ought to be a better way than downloading an entire data file.

timeout 0.1 ls $WORK >> /dev/null ; echo $?
timeout 0.1 ls $SCRATCH >> /dev/null ; echo $?
sinfo $QUEUENAME

curl -n  -L -c $HOME/.earthdatacookie -b $HOME/.earthdatacookie -k -f -O https://e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11/S01W092.SRTMGL1.hgt.zip
Ovec8hkin commented 3 years ago

I would advise that you determine how to make these checks first, and then we can revisit this. Since you haven't determined how to perform some of these checks yet, it insinuates to me that a lot of these services are fairly stable, which will make actually testing if they're down or not very difficult in the first place, and maybe unnecessary at all.

You may want to check with your contacts at some of these data download facilities to see if there is a public server address you can ping and get back a response code for. That would be the easiest way to check if their servers are active.

Ovec8hkin commented 3 years ago

https://datapool.asf.alaska.edu is a dead link. Please update.

--2021-03-16 11:21:07--  https://datapool.asf.alaska.edu/
Resolving datapool.asf.alaska.edu (datapool.asf.alaska.edu)... 137.229.86.206
Connecting to datapool.asf.alaska.edu (datapool.asf.alaska.edu)|137.229.86.206|:443... connected.
HTTP request sent, awaiting response... 404 NOT FOUND
Remote file does not exist -- broken link!!!
falkamelung commented 3 years ago

I am getting this one:

image

You don't get this? May have to use a more specific address as in ssara_federated_query.bash Also possible that I have some credentials stored.

Ovec8hkin commented 3 years ago

I get that screen, but I get the 404 error through WGET. I use the presence of a '200 OK' response to determine online status, so I cant determine the status of this service while the URL gives a 404.

Additionally, I need a wget url to check insarmaps and I more info on how to determine the status of the queue.

falkamelung commented 3 years ago

Not sure I understand. So checking for the ASF server does not work well? If there is no proper wget command we can download a file (and interrupt after 1 second) to see whether anything comes over.

If you are not sure this is a good case to put multiple options into teh script (commented out) and when the server is indeed down we try in detail.

Ovec8hkin commented 3 years ago

I can wget the server, but I am doing so in a way that doesn't initiate a download (it just asks the server whether it is available for a download request). The URL https://datapool.asf.alaska.edu returns back a 404 not found error code, because that page doesn't exist. I probably need a more specific URL to wget.