Closed CharlesBlais closed 1 year ago
I would like to encourage discussion on this item once more. We are finding RSYNC ever more difficult to manage. On Canada's POV, incoming connections are difficult to justify but outgoing connections are not.
RSYNC port does not fall within the standard protocols permitted among many organizations and, quite frankly, quite archaic for what INTERMAGNET is using it for. Are there any alternatives? We briefly talked in committee about Kafka and MQTT that could be an option if the incoming institute permits these none standard ports.
In many cases, HTTPS is to most open wide communication port permitted so is there any transfer protocols that can utilize this protocol?
We use rsync over ssh, but I know that has it's own issues.
WebDAV can be enabled to allow file uploads over HTTPS, possibly even with an SVN server if you want versioning 🙂 Alternatively, could an S3 bucket with ACLs be set up to allow uploads there, and you could download from that external location?
Thanks,
Jeremy Fee Computer Scientist USGS Geologic Hazards Science Center, Golden, CO
From: Charles Blais notifications@github.com Sent: Tuesday, July 7, 2020 11:52 To: INTERMAGNET/wg-www-gins-data-formats wg-www-gins-data-formats@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [EXTERNAL] Re: [INTERMAGNET/wg-www-gins-data-formats] Data transfer upgrade from RSYNC (#6)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
I would like to encourage discussion on this item once more. We are finding RSYNC ever more difficult to manage. On Canada's POV, incoming connections are difficult to justify but outgoing connections are not.
RSYNC port does not fall within the standard protocols permitted among many organizations and, quite frankly, quite archaic for what INTERMAGNET is using it for. Are there any alternatives? We briefly talked in committee about Kafka and MQTT that could be an option if the incoming institute permits these none standard ports.
In many cases, HTTPS is to most open wide communication port permitted so is there any transfer protocols that can utilize this protocol?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/INTERMAGNET/wg-www-gins-data-formats/issues/6#issuecomment-655024890, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AANDNCAO5XSG7YMN4UCDXILR2NOHRANCNFSM4IF2AJWA.
Ya, SSH is again one of those very scary ports.
Yes, we have one project using it but my knowledge of it is very limited. It's like a file structure shared over HTTPS with directory and file creation. In other words, it transferring whole files and not differences correct? Might be a bit heavy on bandwidth (and not really mean for real-time) but definitely something worth considering.
As for S3, who pays? That is a big problem. INTERMAGNET isn't that much data so the tab isn't to big on a SC3 bucket but most government method of paying that service is difficult. Also, you pay when getting out of the S3 bucket.
Public dataset could be an option (Amazon pays) except some of the data is not public...
USGS is mid conversion to streaming 1Hz seedlink to users, following conventions we previously discussed. Definitely open to using that in favor of file transfers. The geomag-algorithms python library could also be used to convert IAGA-2002 inputs into MiniSEED for users that do not produce it natively.
Ya, same with Canada. We use SeedLink internally but not yet to external partners (hopefully soon). Canada and US have very has lots of commonality on that POV since we both our organizations share seismic infrastructure but not everyone (which is where the challenge stands). It's COTS (commercial off the shelf) so I always in favor of that versus custom stuff. So maybe a better question is to ask what can GINs support between themselves? As you say, libraries are available for institutes to convert IAGA2002 to Miniseed.
Bolder, Ottawa = SeedLink Paris = ? BGS = ? Kyoto = ?
Does geomag-algorithm support writing to a ringserver? I know obspy support reading but not writing. We wrote one in Python that converts to datalink for the ringserver so we could share. If so, then the underlying infrastructure is ready to go its matter of giving simple install instructions.
We use an intermediate EDGE server that accepts miniseed, archives data for later query, and forwards to ringserver for distribution, but I'm also interested in your ringserver code if that is open source.
We should discuss with other GINs because there are several options.
I can't share the whole program since certain parts have more Canadian related components but I took the parts of this topic interest with an example console_script (untested).
Added all GIN (except Kyoto since no github representation) for their input/capabilities on improving data distribution. Having some improvement over rsync amongst GINs may be a simpler first step.
We've spent some time discussing this (I can see a discussion document from 10 years back) without coming to any conclusions. To make progress I think the first thing we need to do is to agree what it is we are trying to achieve. I don't think replacement of rsync is a sufficient description. Here's my suggestions as a starting point for a discussion on our requirements:
Charles is correct, we could take advantage of the SeedLink experience available within the GA seismic network and would certainly be interested. For a while now we have been thinking about ways the geomag network could better integrated into the bigger pool of the GA seismic system. I think it will still take us quite a while to get things moving and work out how best to implement though.
BGS has a seismic section that uses Seedlink, but I think it's very unlikely we'd integrate our Geomag transmission into their systems, partly because ours are more reliable and already give us what we need. However we could use their expertise to set up Seedlink receiving systems for the other GINs to transmit to.
In answer to point 3, the metadata system that we have at BGS (and which is available to the community) will make it possible to add much of the metadata needed to construct an IAGA-2002 file from a Seedlink stream I think?
We need some input from the other two GINs (France and Japan) on whether Seedlink would be something they can integrate. Are Hiro and Virginie following this issue?
I googled and find that the Ipgp seismic section use Seedlink as well. I am trying to contact my colleagues to find out more (I'm not sure we will have an answer before the end of the meeting, tomorrow is National Day). Personally I have no experience in the matter and have no idea how to integrate this in our system.
Answer to point 3 : Our problem is not the cost but the workforce, because I'm the only IT staff. So if we can use the expertise of the seismic comunity it can only be beneficial.
Ya, I feel you Virginie! That's good to know about IPGP seismic stations using SeedLink. The challenge won't necessarily getting SeedLink setup since it's I think it can made transparent to the installer. The challenge will be more related to having to to convert IAGA2002 for real-time streaming. Your GIN or even Kyoto gets data from other stations that aren't SeedLink. As before, its not 1 for 1 relationship. Also, how to deal with backfill, corrections, etc. In any solution will require development (or infrastructure modification) so I don't think there is an easy way out of this.
Bonne Bastille!
The situation at Kyoto GIN is basically the same as at Paris, I have to say. I'm making a search over the Internet regarding Seedlink, while I'm making an enquiry to IIMC of my university that is in charge of information/internet security of any software with data exchange.
Personally I have no experience on Seedlink but if things in GWD are going for it, Kyoto GIN can switch to reply on it.
Charles, you are right, as I know in seismic they just transfer raw data. The BCMT developps a new acquisition system so I asked my collegues how they deal with the real-time transfert.
Thanks Hiro and Virginie for your updates. Can I suggest we have two actions: 1.) On USGS, Kyoto, IPGP, NRCan to investigate the possibility of using Seedlink to transfer GIN data to BGS 2.) On BGS to investigate the possibility of setting up a Seedlink receiver to receive GIN data from the other GINs I'm pretty sure that this will be possible at BGS, but I need to check. If it is possible it will probably take some time to implement because we need to prioritize getting the existing rsync arrangements transferred from NRCan to BGS.
I agree. Yes, the rsync is a priority because there are still plenty of challenges of getting from file base to SeedLink and recognizing the lack of resources (people) of other institutes. Canada and USGS have it easy since we don't handle as many observatories like the other three but we are two resources that may be able to contribute knowledge, at least, since we both use SeedLink currently. I'll add it as an action item amongst all GINs but also opening it up to other institutes, like @ALewis-GA , who can possibly offer their own POV and challenges. If any other institute would like to contribute, please get involve (the more help the better). For example, @sputnik-a I believe GFZ uses SeedLink for some parts.
Thanks Charles. Understanding the restrictions on resources, I'd still like us to have an action that leads us forward to a potential way ahead. If we could at least get to the point where we agree Seedlink will work for us in INTERMAGNET (and thus eliminate consideration of other technologies such as message brokers) I feel this focus on a single technology will help us make progress.
Simon, I realize that it would be good to quickly converge on a technology, but I would like to have my colleague's feedback because if I agree with seedlink, this will inevitably have an impact on my team.
For example, @sputnik-a I believe GFZ uses SeedLink for some parts.
Yes, that's right. We use SeedLink for a few observatories, mainly where cooperation partners already had SeedLink implemented (BFO, VNA). We use SeisComp and bash scripts with mseed2asii in order to integrate the data in our standard workflow.
Before making a final decision, what is the experience of @leonro and @stephanbracke on using MQTT?
Thanks Virginie, yes I can see that we need to make the right decision, one based on a solid understanding of what is possible at our institutes. I don't want to hurry that process, just be sure that we are starting to move away from a review of what technologies are available and converging towards a solution.
Achim, The code you use sounds quite simple? Would it be possible to describe it further?
MQTT is working fine, very stable and with minimal requirements. We use it internally to stream realtime data from all sensors (usually packets covering one second) onto our main server, Besides we use it for one external station for realtime streaming (same conditions) accessing one of our IOT brokers. MQTT is the standrad protocol of basically all Internet-Of-Things IOT applications. Currently we are setting up a secure broker (MQTT over SSL) as well. It is very simple to use, it is secure, well supported and tested and very stable. Depending on the data block you are streaming you can also send non-periodic data, updates to previous data etc. And there is a big community (no geoscience however) behind it.
Anyway, I am aware that seismologically related institutes focus on internally well established techniques. And a seed infrastructure is usually available already.
Achim, The code you use sounds quite simple? Would it be possible to describe it further?
This has been mainly implemented by Oliver Bronkalla at GFZ, and I don't know all of the details: 1) SeisComp is used to listen to the incoming data streams and mseed-files are written to a dedicated directory. 2) A bash script is used to check for new files and writes the data into an SQL database. As an intermediate step, temporary ASCII files are produced using mseed2ascii provided by GIPPTools. 3) Daily files in ASCII-format are created and updated from the SQL database. These files are used for further processing. 4) Data gaps are logged and can be filled explicitly. Here, I do not know the exact details.
If you are interested, I could get more details, and we are happy to share code.
This sounds like more or less what we'd need to do at BGS to receive incoming miniseed data from other GINs, so yes certainly be interested in what Oliver has done. However maybe it's sensible to wait for updates from Kyoto and Paris first on the suitability of miniseed at their institutes, before starting to look at implementation.
SeisComP3 is a licensed sofware so is BGS ready to purchase it? That is what we do as well and we use its FDSNWS capabilties instead of reading miniseed files so that we are more mobile. We then use this code, posted earlier, https://github.com/CharlesBlais/geomag-fdsnws-query/tree/master/pygeomag to convert from FDSNWS to IAGA2002 and IMFV1.22 format.
fdsnws2directory --url $FDSNWS --directory "/somedirectory/\%Y/\%m" --format iaga2002
Do you know what the cost is? This is not the only software capable of handling miniseed streams is it?
As far as I know, SeisComp should be free for non-commercial usage: https://www.seiscomp3.org/license.html
SeisComP3 is free for non-commerical use but I wouldn't know what the free encompasses exactly. That is something Canada can asked to Gempa (or maybe GFZ since they are in better relationship - its their sister company). https://www.gempa.de/products/seiscomp3/
No, its not the only software, there is also https://ds.iris.edu/ds/nodes/dmc/software/downloads/slarchive/
For BGS we'd just need to implement a miniseed client wouldn't we - to pull data from miniseed servers at the other GINs? Looks like there are several ways to do that, including ObsPy: https://docs.obspy.org/tutorial/code_snippets/easyseedlink.html#seedlink-tutorial
I imagine it's possible with Earthworm too, which I've had experience of in the past.
I would personally look at slarchive in link earlier. It's packaged with SeisComP3 and its probably the software that GFZ uses. We use CAPS part of SeisComP3 but you don't need that.
slarchive connects to a SeedLink server, requests data streams and writes received packets into directory/file structures. The precise layout of the directories and files is defined in a format string
This sounds like more or less what we'd need to do at BGS to receive incoming miniseed data from other GINs, so yes certainly be interested in what Oliver has done. However maybe it's sensible to wait for updates from Kyoto and Paris first on the suitability of miniseed at their institutes, before starting to look at implementation.
Undecided since I'm not quite sure what is necessary to set up a seedlink server at Kyoto.
What I learnt is that SeedLink is a software developed/implemented in seismic communities.
IRIS may have intensive experience over SeedLink such as operating ringservers as SeedLink streaming servers:
To configure a server, we need to download source codes from:
https://seiscode.iris.washington.edu/projects/ringserver/files,
which redirects us to a GitHub site (again).
I'm puzzled but is it possible to ask Oliver what is going on in GFZ regarding SeedLink? > Achim-san
Hiro
To answer to @sputnik-a MQTT is still working fine for me. As Roman already explained it is easy to use and well established. Commmunity is large and help is available online. On the otherhand to decide what to use exactly I understand that when the GINs have already a seed infrastructure in place, this is for them the way to go. I don't know what the impact on client side is but you would probably need to provide a client software library to enable observatories to integrate. If you have nothing in place mqtt is probably easier to learn by yourself and clients can use whatever they prefer to integrate. Beside technology there is still the fact that messages needs to be standardized. We started a discussion on that some while ago but it didn't evolve a lot.
Both SeedLink and MQTT are message passing algorithm and you dont have to worry, for either, of their background mechanism. SeedLink content is defined (but limited) which is miniSeed while MQTT is open.
@hiroakitoh hopefully installing ringserver or ActiveMQ isn't a challenge in your institute (both are software). You would probably be required in any solution to install a somekind of software. That being said, your not alone. Whatever is decided needs to be simple and easy for institutes.
My big argument against MQTT is that it will require dev work and implementation by all institutes and we've recognized that we dont have much resources to help. For SeedLink, IPGP, NRCan, BGS, USGS, GA, GFZ have already shown they have or potentially have resources within their own institute for reference , help, or they have it already. The archive mechanism exist therefore BGS job is done and the only challenge is providing a software (or methodology) for taking a file to SeedLink. I did initial test of simulating "IAGA2020 to miniSeed" and sending it through SeedLink results in a file on the archive with appended content. What I mean is that if you have a stream with data from 0-10 min and you send another one after from 0-20, you get an archive file with 0-10,0-20. For SeisComP3 CAPS however, it drops the second packet since it considers it duplicate.
MQTT does work for transmitting data, but is more of a transport system that is regularly used to transmit timeseries data. It does not define a format, which is usually custom and I suspect varies between groups. It does have options for "reliable" transmission, which are used after an initial connection to establish the subscription, and messages are buffered per client connection.
SeedLink is built around the MiniSEED format, which is a binary format for timeseries used by the seismic community, and is only used to distribute MiniSEED. It is reliable by default, with the server tracking one buffer of data and clients tracking their position within that buffer. It also has the option for new clients to access data that was previously transmitted and is still in the buffer.
At USGS, we import IAGA-2002 into our server from other organizations by checking existing data gaps and only adding new data (to avoid the 0-10, 0-20 duplication mentioned by @CharlesBlais ). We also use additional EDGE software that supports efficient queries of this data, and a python implementation of the geomag data service (which will eventually support any new CovJSON format).
Partly putting a comment in here in the hope of sparking some conversation...
Reading through the previous contributions, it looks as though we had two possible contenders for an upgrade from rsync: Seedlink and MQTT, and that the consensus is that we favour seedlink and should start looking at that. Is that what other people think?
In terms of the duplication problem that Charles mentions, this wouldn't be an issue for a receiver at BGS, since the data would be put into the GIN's flat file database (which would resolve duplication) not into ASCII IAGA files.
Glad to hear BGS can handle duplication!
I would like to hear others opinions on all this; particularly Kyoto and Paris GINS.
I can't speak much for MQTT implementation since MQTT is the transmission protocol hence, as stated in earlier comments, missing components would be:
As for SeedLink option, and to summarize what is mentioned earlier, the simplest setup would be:
What I understood from all this discussions is that it is not the question of what is the best suitable protocol/technology to do this, but more of a search if we have already another team that manages the servers for us. In this concept I understand that last year you decided to go to use miniseed/seedlink solution. I also have the impression that you mainly want to use it to push data around between the gins. When I read about parsing iaga files and formatting them into miniseed packages real time is no concern at all. So in this concept we should not mention mqtt anymore. But as Charles mentioned its inconveniences I wanted to give my comments on that one. I agree on one and probably the most important one the message broker selection and more importantly the maintenance that goes with it. This I can't really comment on because it will not be my job to do that. In Dourbes I manage currently 4 lightweight servers and my messages are small binary one sec data messages.So the load is very limited. From the client perspective I think that mqtt offers the most broad solutions in all program languages and even in form of linux command line tools.In what you will use as message I agree that you need to standardise it but as it supports binary and text we could do the same as we currently do over http or use miniseed packages if you prefer that. The advantage of mqtt comes from the client site and the possibility to implement it on very small transmitters with a certain level of garantee delivery. If my understanding of magpy is correct it has mqtt capabilities so we could even look at this structure and use that as standard. In current literature I see more and more people making mqtt bridges towards what they call "legacy" seedlink servers. If these bridges are available you can leave the choice up to the gin or client if they want to use mqtt or seedlink. I think on client perspective everybody can write a client for mqtt like http, it is just a question of standardizing the message format.
Thanks Stephan. I think I understand better the use case you are describing now - particularly focused on where you have remote installations which require low power data loggers and possibly have low bandwidth / lossy communications links. Of course the same system could be used in places where the facilities are better as well. It seems to me less likely that these observatories would be sending data direct to a GIN, but more likely to send data to their host institute for some form of processing before the data is forwarded to the GIN - do you agree?
I do agree that the model for the type of seedlink communications system we are talking about would firstly be for transfer between GINs, but I would like it also to be available for individual institutes if they are able to use it, so that they can decrease the latency associated with their data.
Stephan got it to the point. Just to add on that and for those who are interested: the MQTT transport system I am using at Conrad including clients, protocols and broker (mosquitto with or without authentication) is described here: https://github.com/geomagpy/MARTAS. The delivery protocol is flexible and can be defined in a library. Currently supported is JSON as used in many IOT applications and, similar to Dourbes, simple ascii or binary message lines which I prefer because of the small message size. Yet I am using that primarily for realtime transmission of individual sensors raw data. IAGA files are combinations and usually require some processing. I am currently using MQTT for such processed data sets only to feed our webservice.
Simon, for the duplication issu mentioned by Charles, as I anderstand it would be a problem on the Gin's side ?
Charles, if I understand your summary correctly, for the SeedLink option I will need to install on the Gin :
That's all ?
Dear All (and Charles),
I would like to hear others opinions on all this; particularly Kyoto and Paris GINS.
Kyoto has no preference to either MQTT or seedlink.
If the latter is our decision, will try to implement it here.
However, one comment that I can make here is that to change the way of data exchange costs us considerably.
A follow-up question here is which is more secure and long-lasting.
If the new method is secure enough and sustainable over time, then it worth invoking.
H
On duplication, if our first intended use of seedlink is to replace rsync for data transfer between GINs, then the transmitting GINs (GOL, KYO, OTT, PAR) would not be concerned about duplication, only the receiving GIN (EDI).
In terms of what software a GIN would need, I think your list is correct Virginie (provided all the data you need to forward is in IAGA-2002 format), but I'm not certain as I haven't used this software yet.
Simon, Thank you for your answer. Current situation is :
So if I anderstand well the evolution of the system (change rsync to SEEDlink), IMOs will continue to send variation data (IAGA-2002 format) to Paris' Gin via curl. And Paris' Gin will :
Yes, that's my understanding too.
If there's general agreement about this, I'll add investigation of miniseed at BGS to my list of tasks to work on. The priority over the next few months will be to complete the work of moving the data archive from NRCan to BGS (for which we'll need GINs to send their rsync data to BGS) before starting to look at seedlink.
Just one small question as it was stated in the beginning that rsync has security issues. Can someone explain to me how seedlink protocol is more secure. I looked at obspy code to find client implementations but find only parameter settings: server, port and timeout, nothing about security settings.
SeedLink has no communication encryption (SSL) but that isn't required in the vast majority of our seismic operations since the information is considered unclassified. It's protocol is limited and does not expose any sensitive part of the system. It only accepts miniSeed format data. If the information is considered classified, then VPN tunnels are used between receiver and sender.
Access control is done at the server using IP white listing through the configuration but Canada supplements that with application firewall (firewalld) and hardware firewall.
I am aware that MQTT does have SSL and user access control.
On Canada perspective, the support extends to a wider community. MQTT would require design/maintenance/support within this community. Much like Kyoto, "change the way of data exchange costs us considerably". Geomagnetism in Canada is also limited in resources, much like other institutes, but by tapping in our seismic operations, it extends it considerably. Even our Geodetic operations, through UNAVCO, our merging with the seismic community. Its now becoming increasingly simple since seismic companies, like Nanometric, are now adding support for GNSS operations.
For client libraries, there may even be simpler ways. We can always ask other resources, like IRIS, for additional ideas including ways to tap in to some of their nice tools like ISPAQ for metric information. https://ds.iris.edu/ds/nodes/dmc/software/downloads/ispaq/
Talking about security, we do use lot of ftp transfer for our data :
ftp is not known to be very secure. Is it an issue, is there an easy way to change that. Personnaly I plan to stop the bcmt ftp and just provide our data on http. But I am not sure that on the user side it will be easy (I have plenty of demand on how wget works). What do you think ?
Open discussion on the future of data transfer protocol between GINs and the archive and, potentially, the institutes to the GINs.
RSYNC is not considered a secure protocol and even securing it using SSH tunnels can pose a security threat to the institute. Allowing port 22 (SSH), even if secured in a DMZ or any other mechanism, is a risk to the infrastructure.
Is using message protocol an option? Possible options are:
SEEDlink is used in Canada, USGS, and GFZ Potsdam and is standard in the seismic community with commercial tools supporting it.
MQTT is used in Belgium and Vienna with customized tools.
In all cases, metadata is not transferred.
Should we use a scenario for real-time and another for archive method that has no real-time requirement?