ECCC Datamart/ddi redesign

petersilva commented 2 years ago

This is for the Environment and Climate Change Canada (ECCC) Service offering. @junhu3 @habilinour @ericvong we will be offerred h/w for EDCM... It would be good to have a unified plan for next gen Datamart...

failover?
cloud?
space.

junhu3 commented 2 years ago

I like NOAA BDP idea. broker to cloud

petersilva commented 2 years ago

fwiw... regarding a BDP ( https://www.noaa.gov/information-technology/big-data ) like plan, we could be the one link on the way to the cloud, as they show one sender in the NOAA deck, but I think the intent in ECCC is to write directly to cloud. Since all clients are taken care of with one write, we are not needed, and would not be involved... The disadvantages for them are the need to use different API's for different cloud providers. File i/o is fairly standardized around S3, but notifications are a cloud vendor specific mess. Also involves a major, major culture change. Cloud is very fashionable, and will certainly be here to stay, but it also is at the peak of a hype cycle, and a bit faddish. Support cloud: sure, but only cloud? perhaps a bit risky.

To implement something like BDP, the client thinks they don't need us (and maybe they don't!) it would be up to us add value by adding cloud API's to sarra so that they don't have to make changes, and we can push to the various clouds on their behalf. Even if we do that, the client could use sarra directly without using a intermediate data pump (saving on charging for intermediate storage, networking, and compute) so the burden is on us to communicate our value in that realm.

It's also different for Americans, since the cloud companies... are American. ECMWF is doing a similar thing, using European cloud providers... nobody is doing a BDP-like thing relying exclusively on foreign providers. It will be interesting to see how this develops, as various countries have data in different cloud providers, and who wants to make money on i/o transfers between the cloud providers when someone wants to do something like TIGGE (which would involve transfers between european and American providers). This is fine for big players, but Canada doesn't have any such vendors that I know of, we lack scale, so we are handing all data dissemination to foreigners. Not an absolute no... but a consideration for sure.

And the question is, is "dissemination" different for outsiders, vs. internally. To ECCC they think it is. We have tried to make it the same for 20 years, and today the tech for internal and external distribution is identical. If you rely on cloud stuff to send data to outsiders... what happens for internal users? Do you implement a completely separate system just for internal distribution? Do you have the forecasters directly access cloud services through cloud implemented services, so that nothing is in-house? That's the extreme it leads to, and futuristic, but it might be feasible and even desirable, but maybe not.

What we can do in the mean-time is figure out what we have to offer, if anything. We can do an intermediate thing by implementing our normal data mart using cloud provisioning, and adding S3 to it. That is something that can be deployed in cloud, or on our own hardware or anywhere in between. If that's not useful to the client, then we may have no role to fill on the file, and should find something else to do ;-)

matthewdarwin commented 2 years ago

ceph is a great tool if you want to run your own s3. I set it up for my company.

petersilva commented 2 years ago

conncecting with SSC resources to be able to "test drive" various datamart options. example Options: 1) Conservative: just make a VM and give it an S3 to use as a file system, reproduce on-prem datamart with minimal transformation. 2) AddS3... Add ceph on 1). 2) UseS3: Modify base application to use S3 to store. 3) Containerize: https://github.com/wmo-im/wis2node uses sarra with separate web server and broker, fully containerized. Could probably use that deployment literally. 4) ContainerizeS3: modify 4) to use S3.

petersilva commented 2 years ago

Use cases:

some people want a unified view (ddi style) (an internal instance of the hpfx or dd) to see whether certain data has arrived. So it's abbout viewing what data is present, without necessarily reading it, just for browsing & sampling.

On internal systems, we have had more and more trouble getting a single file system that can support all datasets... we are failing at that. Many calls for the unified file system falling behind, lots of cases of operating in degraded mode. In contrast to systems providing a single view, our ddsr oriented data pumps with N nodes, used for operations, have a sharded view, where each transfer engine in the cluster supports 1/N of the data being transferred. This sacrifices the unified view however.

As an example of restoring a unified view while sharding the data itself, we could have a "browsing" server on the side that would mount the file systems from all transfer nodes (in read-only mode) and use overlay-fs to provide a unified view. but the performance of such an solution for people who want local access (3.) would be useful at all.

some people are ok using a subscribe to get data to obtain copies (normal sr_subscribe usages, such as what is done by GOES, URP and some other internal groups.)

This is the preferred simplest use case, and is universal on the public access side.

some people require that we put files in directories (using SFTP) determined by them with naming conventions determined by them (This is the most onerous use model, we try to make the directory trees and file naming conventions common and minimizing renaming.)

The above is how many legacy clients have received data for the past 20 years.

some people want local file access to such file trees (the ability to open() files in directory trees, scanning directories and such. (We've done some internal testing using ISILON hw.. and the results were that it could not even keep up with the feed, much less access by clients.)

This is an HPC request that has been outstanding for several years. We are having trouble getting a sufficiently performant underlying implementation to be worth deploying.

the cloud people want only API access, which is different from file downloading, and local file access... I guess it boils down to S3.

There are a number of appliances (e.g. Isilon) that provide S3, free implementations (ceph) and cloud services from all the vendors... we can align to use any or all of these.

petersilva commented 2 years ago

scope exclusions:

Minimal Metadata

this portal is supposed to require/use minimal metadata. There are other projects to do comprehensive metadata management, as a separate function from and what we ask of such metadata systems is to define canonical paths.

The lack of metadata is a means of making the solution more general. This is a data feed/retrieval only, not a all singing and dancing solution. ideally the data formats themselves will have some kind of key that can be looked up in a metadata management system.... ideally: the canonical paths is that key, so that nothing else is needed.

Every scientific domain is super enthusiastic about having metadata, and there are many architectures for managing metadata. But it is domain specific. trying to be domain agnostic here.

petersilva commented 2 years ago

https://www.linkedin.com/pulse/oops-we-forgot-30m-bandwidth-charges-david-penny/

MetPX / sarracenia

ECCC Datamart/ddi redesign #442

Minimal Metadata