GeoNet / help

An issues repo for technical help questions.
6 stars 3 forks source link

help with retrieval of data from the GeoNet HTTPs repository #96

Closed elidana closed 2 years ago

elidana commented 2 years ago

We have received some similar questions from end users, on how to optimize data retieval from the HTTPs GeoNet data repository.

I am copying those request here so that we can point other users to this reply if needed.

user n.1:

I want to work with a few days of GeoNet data. In the past I'd have copied a days worth using a FTP tool (Filezilla etc) to scratch or similar and viewed etc. How can I do something similar from https://data.geonet.org.nz/ ?

user n.2:

I find it much more convenient to mirror the directory that has site logs, as opposed to manually downloading log files one by one through endless clicking. The command below used to work, but now after the connection message, I never get a response to the HTTP request. So the server is clearly ignoring it.

wget --mirror --no-parent --no-host-directories --cut-dirs=3 https://data.geonet.org.nz/gnss/sitelogs/logs/

--2022-02-17 08:25:02--  https://data.geonet.org.nz/gnss/sitelogs/logs/
Resolving data.geonet.org.nz (data.geonet.org.nz)... 161.65.59.67
Connecting to data.geonet.org.nz (data.geonet.org.nz)|161.65.59.67|:443... connected.
HTTP request sent, awaiting response... ^C

The same URL works fine in a web browser, and gives me a page with a list of links for each log file.

A wget command to download an individual file works just fine:

wget https://data.geonet.org.nz/gnss/sitelogs/logs/2406_20150309.log
--2022-02-17 08:30:00--  https://data.geonet.org.nz/gnss/sitelogs/logs/2406_20150309.log
Resolving data.geonet.org.nz (data.geonet.org.nz)... 161.65.59.67
Connecting to data.geonet.org.nz (data.geonet.org.nz)|161.65.59.67|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: 10385 (10K) [application/octet-stream]

Saving to: ‘2406_20150309.log.1’

2406_20150309.log.1       100%[====================================>]  10.14K  --.-KB/s    in 0s      
2022-02-17 08:30:01 (93.4 MB/s) - ‘2406_20150309.log.1’ saved [10385/10385]

I managed a work-around to fetch all the log files changed since 2020 by saving the html page from my browser, extracting the URLs from the html file using grep and awk, and then running wget in a loop to get each file. But there has to be a better way…

rumachan commented 2 years ago

I don't have any obvious answers to these issues. My go to method no longer works either.

elidana commented 2 years ago

For GNSS sitelogs, another possible (still not ideal, but less painful) option is to first list all available in a specific location, and then do a loop over the file list.

The example below is for someone only interested to download the most recent sitelogs available for each GeoNet continuous GNSS station which code is starting with "A"

#/bin/bash

http=https://data.geonet.org.nz/gnss/sitelogs/logs

curl -l ${http}/ | grep -o 'href=".*">' | sed 's/href="//;s/\/">//;s/">//' > tmp.$$
for sta in `grep "^a" tmp.$$ | cut -c 1-4 | sort -u`; do
   file=`grep $sta tmp.$$ | tail -1`
   curl -O ${http}/${path}/${file}
done

rm tmp.$$
tobiscode commented 2 years ago

Hi,

I'm also in the process of changing our workflow from FTP to HTTP. We previously used Python parsing of wget's FTP .listing files to see which files are available on the server, compare their size and last-modified date to what we had locally, and if they didn't match, re-download the file.

For the first part, seeing which files are available, we now download the html page in Python and use regex patterns to parse all the available filenames. Not very elegant, since the index page layout can change with http server updates, so I'm interested to see whether more robust approaches are gonna pop up here.

For the second part, the checking whether remote files are different from local files, also based on the output of the http index page, it seems to me now that looking at some random days (e.g., https://data.geonet.org.nz/gnss/rinex/2022/001/), that all files have apparently been modified in the last two days - including files from 1999? Is there currently some reprocessing going on and should I redownload all the files, or is there something happening on the server side that changes the last-modified timestamps for all the files? In the latter case, we will have to rely on only comparing the file sizes, which of cours ealso works, but is not 100% robust if the changes are only minor.

Thanks!

elidana commented 2 years ago

Hello @tobiscode ,

thanks a lot for your input and providing this useful background on your workflow!

Hopefully what you find below will answer your two issues nicely.

alternative access mechanism

Very recently, we have started providing our GeoNet GNSS data via a second (and hopefully more efficient) access mechanism.

Data are now available via an Amazon Web Service storage solution: AWS S3 bucket. Details are available from https://registry.opendata.aws/geonet/ This is not fully advertised yet on our website as we are still in the process of populating the S3 bucket with additional datasets, but all GNSS data (except https://data.geonet.org.nz/gnss/rinex1Hz/) are now available via https (https://data.geonet.org.nz/gnss) and via AWS (s3://geonet-open-data/gnss`).

With an AWS Client, you can easily sync the GNSS data to your local storage, and only download the updated files. Python has some AWS client wrapper as well you might want to check out, if you are interested (we use boto3).

If interested and if you have further questions on that, please feel free to open a new "issue" here, and I'll be more than happy to follow up.

Back on the HTTPs:

In terms of robust approaches to download data from an HTTPs server, I'd be keen to know if other users have better ideas, we haven't found any, but I'm very pleased to see this conversation and to receive your comment on that. For the file timestamp, I suspect that the updated timestamp is not reliable, and only due to the fact that the host that serve the HTTPs repository has a cache on its backend, that gets refreshed every so often (we had some unrelated issues a couple of weeks ago, and all the cache was refreshed, that might be why you see files from 1999 being updated).

Real file updates will happen when:

tobiscode commented 2 years ago

Hi,

thanks for your reply!

With regard to the HTTPS, it does seem like the 1999 data was updated again yesterday, so yes it does seem to be unreliable, since I'm guessing none of your bullet points were the case for that data since my last post. (It's still puzzling though, I'm interfacing like this with other servers and they don't seem to have that issue... But I totally understand that that's not a high-priority issue.)

I just tried to install boto3 in the system that we use to download data and it seems that our system is too old, even using virtual environments, it didn't work out-of-the-box. That's a problem on our side though, and I'll see if/when we can upgrade our system to be able to use the S3 option, but until then, I guess we're stuck with HTTPS.

Either way, thanks for clearing things up!

Cheers

elidana commented 2 years ago

Thanks for checking that out @tobiscode,

sorry to know your system cannot handle boto3. Another option is to use the aws cli in a bash code, the documentation is available here https://docs.aws.amazon.com/cli/latest/reference/s3/, but I suspect that you might bump into a similar issue there too.

And yes, I confirm that we are not updating the 1999 files. Our own setup might be different from other data centres, that might be why you see differences.

Thanks to you for your interest and comment!

Elisabetta

tobiscode commented 2 years ago

Hi,

just as a follow-up, conda came to the rescue and I was able to set up Python with boto3 on our old machine - it's working flawlessly so far, and on top of being able to compare last-modified timestamps, this new way is much faster than both FTP or HTTPS. Thanks for the help, and setting up the S3 storage in the first place!

Cheers

elidana commented 2 years ago

oh that is fantastic news, thanks a lot @tobiscode for this feedback!