LSSTDESC / ComputingInfrastructure

Gathering place for CI - Computing and Infrastructure - issues
3 stars 1 forks source link

Hosting medium-sized DESC data products #53

Closed drphilmarshall closed 3 years ago

drphilmarshall commented 6 years ago

@jchiang87 @salmanhabib @katrinheitmann

We need some way of hosting DESC Data Products in the form of medium-sized files on the web for public download. Examples are OpSim output databases, test data challenge image sets and catalogs, and so on. This issue just came up in DC2, where @rbiswas4 has made a modified OpSim db that we'd like to release publically - he checked in with the pub board to find out what the rules are (they're working on them!) and find out if they had suggestions for where this particular DESC data product could be hosted. Full challenge data releases will need a different kind of solution.

Tom G pointed me at the NERSC "Science Gateways" help pages. This looks useful - Dustin Lang is using it to serve WISE data, for example.

In the pub board, Seth thought of Zenodo, which is where our GitHub software releases end up if a DOI is requested. Zenodo can host files up to 50Gb in size, and provide curate-able "communities" that look useful. I made an LSST-DESC Zenodo community for us to see what it looks like.

There are probably other good solutions available too. What do you think? I've asked the Pub Board for advice too, from the publication policy point of view - but I figured you'd know what would be best from a technical standpoint.

CC: @cwwalter @TomGlanzman @heather999 @sethdigel @richardxdubois

ghost commented 6 years ago

I see there is a DM Zenodo community (https://zenodo.org/communities/lsst-dm/?page=1&size=20), but they seem to only use it for reports and documentation.

katrinheitmann commented 6 years ago

Here is indeed another solution -- Petrel. It's connected to Globus online. It's a data sharing platform at Argonne. Tom Uram and I started playing around with it during the last few weeks. They gave us 100TB to start with and we started populating it with some simulations. They are keen on having LSST DESC involved as well.

Petrel itself is simply a Globus interface but for the simulations Tom (in more or less one afternoon) also generated web gateway.

The nice thing is that you have the easiness and speed of Globus attached to it. And we would have quite a bit of storage for free. And no restriction on file sizes.

I attached a couple of slides from the talk I gave in Berkeley. It doesn't show too much, mostly screen shots, but I could explain more if there is interest.

The NERSC Science Gateways make a lot of sense as well.

bccp_jan2018_cut.pdf

On 1/12/18 6:31 PM, Phil Marshall wrote:

@jchiang87 https://github.com/jchiang87 @salmanhabib https://github.com/salmanhabib @katrinheitmann https://github.com/katrinheitmann

We need some way of hosting /DESC Data Products/ in the form of medium-sized files on the web for public download. Examples are OpSim output databases, test data challenge image sets and catalogs, and so on. This issue just came up in DC2, where @rbiswas4 https://github.com/rbiswas4 has made a modified OpSim db that we'd like to release publically - he checked in with the pub board to find out what the rules are (they're working on them!) and find out if they had suggestions for where this particular DESC data product could be hosted. Full challenge data releases will need a different kind of solution.

Tom G pointed me at the NERSC "Science Gateways" http://www.nersc.gov/users/data-analytics/science-gateways/ help pages. This looks useful - Dustin Lang is using it to serve WISE data, for example http://unwise.me/.

In the pub board, Seth thought of Zenodo, which is where our GitHub software releases end up if a DOI is requested. Zenodo can host files up to 50Gb in size, and provide curate-able "communities" that look useful. I made an LSST-DESC Zenodo community https://zenodo.org/communities/lsst-desc for us to see what it looks like.

There are probably other good solutions available too. What do you think? I've asked the Pub Board for advice too, from the publication policy point of view - but I figured you'd know what would be best from a technical standpoint.

CC: @cwwalter https://github.com/cwwalter @TomGlanzman https://github.com/tomglanzman @heather999 https://github.com/heather999 @sethdigel https://github.com/sethdigel @richardxdubois https://github.com/richardxdubois

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/ComputingInfrastructure/issues/53, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jLQEkCFI133bOJ9EDRDLohaABqQQks5tJ_lPgaJpZM4RdBYq.

sethdigel commented 6 years ago

The Publication Policy does not have any specifics about how data releases are made. It does define data releases as always being associated with a journal paper: "material produced within the DESC and made available to the community in machine­-readable form after submission of the corresponding data release paper (and no later than its publication). This includes processed (or reprocessed) data and results of numerical simulations, along with their documentation. The refereed journal papers that document DESC data releases are Key Papers." The likelihood is fairly high that we will want to release a data product that is not actually associated with a journal paper. For these instances, using a service like Zenodo has some appeal because the data product and documentation can be associated with each other, and can be cited through the DOI that Zenodo provides. I suppose an alternative could be to put the data product someplace accessible (maybe a Science Gateway page) and post the associated documentation on the arXiv. Then of course we'd be on the hook to maintain the public Science Gateway.

heather999 commented 3 years ago

We took a first swing at this for the DC2 public release: https://lsstdesc-portal.nersc.gov/