INTERMAGNET / wg-www-gins-data-formats

Repository to track working group discussions for WWW/Gins/Data Formats
2 stars 1 forks source link

Data transfer upgrade from RSYNC #6

Closed CharlesBlais closed 1 year ago

CharlesBlais commented 5 years ago

Open discussion on the future of data transfer protocol between GINs and the archive and, potentially, the institutes to the GINs.

RSYNC is not considered a secure protocol and even securing it using SSH tunnels can pose a security threat to the institute. Allowing port 22 (SSH), even if secured in a DMZ or any other mechanism, is a risk to the infrastructure.

Is using message protocol an option? Possible options are:

SEEDlink is used in Canada, USGS, and GFZ Potsdam and is standard in the seismic community with commercial tools supporting it.

MQTT is used in Belgium and Vienna with customized tools.

In all cases, metadata is not transferred.

Should we use a scenario for real-time and another for archive method that has no real-time requirement?

CharlesBlais commented 3 years ago

For DD download by NRCan, you could change to HTTP(S) with directory listing if you want. We could change, with minimal effort I think, to a simple wget with "--mirror" but directory listing must be enabled.

leonro commented 3 years ago

Regarding https://github.com/INTERMAGNET/wg-www-gins-data-formats/issues/6#issuecomment-806684266: IMBOT can easly be changed towards SSL. Currently IMBOT is temporarly mounting the Paris GIN DD directory using curlftpfs. This can changed to sshfs, which is already implemented and just needs to be activated.

vmaury-ipgp commented 3 years ago

@CharlesBlais ok I can change to HTTP(S) with login/password (I hope I can do that rapidly).

@leonro ok, as you have already an account, we can configure ssh key identification and limiting access to specific IP

CharlesBlais commented 3 years ago

Example docker ringserver/slarchive: https://github.com/CharlesBlais/docker-intermagnet-example. Example tool for converting IAGA2002 to miniseed: https://github.com/CharlesBlais/pyiaga2002.

bgeels-USGS commented 1 year ago

I've been thinking about how the proposed seedlink layout would handle non-sequential data. For realtime data I assume most GINs are transferring data sequentially, but I know that for QD data our processing folks will sometimes finish processing on a month of data before the previous month is completed so sometimes these datasets are uploaded non-sequentially. My current understanding of seedlink is that the client program typically streams data sequentially and maintains a state file that keeps track of the sequence number that corresponds with the latest data block that was received for each channel. I should point out that I'm basing this on what I've seen from Slinktool and from the seedlink library that our EdgeCWB system uses. I'm not familiar with the Slarchive tool that @CharlesBlais has proposed, perhaps that has a routine that regularly tries to fill past gaps? Apologies if this was already addressed earlier in this thread, I haven't quite read through all of the responses yet.

CharlesBlais commented 1 year ago

Hi @bgeels-USGS, quite right that data is tagged by sequence number but the data inside doesn't have to be sequential. You can send data for minute 2 but later send for minute 1, for example. The sequence number is just used by that state file to determine what that ask. I did a while back a docker example for the group (as in the earlier comment) and you could send an entire daily file if needed.

Slarchive is a "dumb" client that just takes those packets and appends into an predefined file structure. In the example I gave, its SDS format. You could send duplicate data and it would just get appended. The one thing we do in Canada however is take that data and compress further (since miniseed packets are 512 bytes and we send incomplete packets). In the end, the files are extra small. BGS would have to do something similar but writing into their structure.

However, in Canada, we actually use SeisComP CAPS (somewhat similar) for inner data centre exchange but its a paid solution https://www.gempa.de/products/caps/ (I think its cheap however). It works great by doing data sync from multiple sources and we have quick support from Gempa. For example, Canada operates two data centres. If BGS used slarchive and would connect to both, they would have to handle duplication (same data from both locations). With CAPS, it does everything for you behind the scene. The beauty too, CAPS is your own data format wrapped a unique CAPS header. You could send real-time stream pictures for example.

FYI, on the push vs pull topic that came up during the meetings, Canada tends to favor pull rather then push; especially from remote unmanned stations. We operate multiple data centres so its much easier to manage connections from the data centre and we have better control on the streams (since we pull). It's the same for data for institutes, much easier to stop the pull. It's all because of cyber-security.

SimonFlower commented 1 year ago

Thanks for the useful discussion so far. On the issue of out of sequence data, I don't think this will cause us any issues in BGS. Provided we have, or can calculate, the time stamp for each data sample (which must be the case) we can work out where to put the sample in the data structures we use.

Having made the decision to work on both MQTT and Seedlink, I'd like to start separate issues on GitHub for both (and close this issue). The motivation for this is because the areas I think we need to discuss are a little different for the two technologies. I'd put a link to this issue in each of the new issues. Can you let me know whether you are happy with this?

bgeels-USGS commented 1 year ago

@SimonFlower that sounds good to me. @CharlesBlais after doing some testing this morning I was able to verify that a ringserver will relay data that is fed to it non-sequentially as long as the client requesting the data provides a start date when sending the "DATA" command (this is usually optional). So I agree, this is indeed a non-issue.

SimonFlower commented 1 year ago

Closing this issue, as it has been superseded by #12 and #13