Open falkamelung opened 3 years ago
You need to be significantly more specific here. What services are being checked? How do you check them? What constitutes them being "down"? etc.
The basic idea is to check whether the server (service) from which we download is online and whether downloading works. In many cases processing failures are the result of a server being down. The timeout
command is one option to do this.
It is a bit difficult to test whether outages will be caught because the services are most of the time online. We will only know once an outage occurs.
Actually, we have control over jetstream and insarmaps server. On those we can temporarily close ports to check whether an outage is caught.
This provided me no new information. Namely, you have failed to answer the major question here: how do you check if a given service if down or not? I want EXACTLY how you are checking whether or not EACH service is online or not. Provide server URLs, command line commands, etc. for EVERY service you want checked. Until you have provided that, I can not even begin to consider how to write this type of script.
@falkamelung I think this not practically easily doable. If the service is not working, even a simple 'ls' command would not work let alone running bash scripts for checking It happened to me several times that $WORK or $SCRATCH were in maintenance, and I could do almost nothing
@mirzaees That is a valid concern. I presume that $WORK or $SCRATCH being down is the last common of the possible outages, so, to some degree, we could check most of the other servers. if $WORK or $SCRATCH are down, as you mentioned, you can't run anything anyway, so I don't think it really matters. Running code that needs $WORK/$SCRATCH will obviously fail if those services go down mid run, but there isn't anything we can about that.
Hi @mirzaees , for --workDir
I verified that it works as suggested above:
timeout 0.01 ls $WEATHER_DIR/ERA5/ERA5_N20_N40_E60_E80_20150608_01.grb >> /dev/null ; echo $?
0
//login3/work/05861/tg851601/stampede2/insarlab/WEATHER/ERA5[1060] timeout 0.001 ls $WEATHER_DIR/ERA5/ERA5_N20_N40_E60_E80_20150608_01.grb >> /dev/null ; echo $?
124
For $SCRATCH it also works:
timeout 0.001 ls $SCRATCH >> /dev/null ; echo $?
124
//login3/scratch/05861/tg851601[1064] timeout 0.1 ls $SCRATCH >> /dev/null ; echo $?
0
Hi @Ovec8hkin , I added details about server as much as I had handy. Is that enough information? We will get the two missing ones asap (ECMWF and downloadGEP)
@falkamelung I want you to write a bash script (or just a text file) that just runs one check after the other (however you verify each system is online) and prints out the status so I can see how you're checking each service (and I want you to verify that whatever checks you're using ACTUALLY do what you want). You still have not provided nearly enough detail for me to run any of these checks (for instance, what qualifies as ONLINE for each of these, what timeout should I use for each one, etc.), and Im not going to manually ask you for each one. When you've done that, I will take it and add the necessary error checking and rerunning.
I don't know which is an efficient way to check whether a download service is online. Here some commands that I would suggest for the first services (but a longer time for WORK and SCRATCH, specified as a variable). What is an efficient way for USGS and ASF? There ought to be a better way than downloading an entire data file.
timeout 0.1 ls $WORK >> /dev/null ; echo $?
timeout 0.1 ls $SCRATCH >> /dev/null ; echo $?
sinfo $QUEUENAME
curl -n -L -c $HOME/.earthdatacookie -b $HOME/.earthdatacookie -k -f -O https://e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11/S01W092.SRTMGL1.hgt.zip
I would advise that you determine how to make these checks first, and then we can revisit this. Since you haven't determined how to perform some of these checks yet, it insinuates to me that a lot of these services are fairly stable, which will make actually testing if they're down or not very difficult in the first place, and maybe unnecessary at all.
You may want to check with your contacts at some of these data download facilities to see if there is a public server address you can ping
and get back a response code for. That would be the easiest way to check if their servers are active.
https://datapool.asf.alaska.edu
is a dead link. Please update.
--2021-03-16 11:21:07-- https://datapool.asf.alaska.edu/
Resolving datapool.asf.alaska.edu (datapool.asf.alaska.edu)... 137.229.86.206
Connecting to datapool.asf.alaska.edu (datapool.asf.alaska.edu)|137.229.86.206|:443... connected.
HTTP request sent, awaiting response... 404 NOT FOUND
Remote file does not exist -- broken link!!!
I am getting this one:
You don't get this? May have to use a more specific address as in ssara_federated_query.bash Also possible that I have some credentials stored.
I get that screen, but I get the 404 error through WGET. I use the presence of a '200 OK' response to determine online status, so I cant determine the status of this service while the URL gives a 404.
Additionally, I need a wget url to check insarmaps and I more info on how to determine the status of the queue.
Not sure I understand. So checking for the ASF server does not work well? If there is no proper wget command we can download a file (and interrupt after 1 second) to see whether anything comes over.
If you are not sure this is a good case to put multiple options into teh script (commented out) and when the server is indeed down we try in detail.
I can wget the server, but I am doing so in a way that doesn't initiate a download (it just asks the server whether it is available for a download request). The URL https://datapool.asf.alaska.edu
returns back a 404 not found error code, because that page doesn't exist. I probably need a more specific URL to wget.
One reason why processing fails is because one of the data servers or $WORK is offline or slow. We need a script to check for the different services used. This script should be run in the workflow before a particular service is used and if down rerun after 1 or 5 hours. At the same time it can be run daily from cron and send an email if one service is down
We could use the timeout command. For all services we could run three commands using e.g.
timeout 0.1
timeout 1
andtimeout 5 cmd; echo $?
returning ONLINE, SLOW, OFFLINE, respectively. Is the timeout command appropriate for everything or is there something better?Maybe:
Later:
I will add this to every step in minsarApp.bash. I.e to run
check_services.py --downloadASF
, capture theexit_code
and start ssara_*.bash only ifexit_code=0
.If one service is down (e.g. downloadASF, demServer), it should go in a waiting loop, try again after 5 hours and exit after 2 days . Can you think on how to implement this?
I think it should create a
check_services.log
and add an entry for each check.It should display the commands that it runs to the screen. If that interferes with proper interpretation of the exit code (I don't think it does) we can have a
--verbose
option.Here some ideas on how to check whether servers are online: https://www.2daygeek.com/linux-command-check-website-is-up-down-alive/
Another idea: can we display the results on a website? Sort-of a traffic light with 5 lights. If they are all green everything is online. We could have a mini-file uploaded to jetstream for that.