move forcing data to zenodo or ESGF?

CICE-Consortium / CICE

Development repository for the CICE sea-ice model

Other

57 stars 131 forks source link

move forcing data to zenodo or ESGF? #348

Closed eclare108213 closed 4 years ago

eclare108213 commented 5 years ago

zenodo allows data sets up to 50 GB and would provide a DOI for each one. Let's investigate moving our forcing data from NCAR ftp to zenodo, or alternatively, the Earth System Grid Federation. We'll need to consider how it is logically broken up and listed, and how the travis scripts would retrieve it.

eclare108213 commented 4 years ago

@dabail10 writes:

Here is everything on the ftp site in KB. The zero size ones are links, so not a problem. The biggest issue is the "all" files, but the addition of the JR55 forcing for gx1 are also big. I would like to delete the 2017 and 2018 files.

0 CICE_data_all.tar.gz 0 CICE_data_forcing_gx1_2005.tar.gz 0 CICE_data_forcing_gx1_2006.tar.gz 0 CICE_data_forcing_gx1_2007.tar.gz 0 CICE_data_forcing_gx1_2008.tar.gz 0 CICE_data_forcing_gx1_2009.tar.gz 0 CICE_data_forcing_gx1_MONTHLY.tar.gz 0 CICE_data_forcing_gx3_all.tar.gz 0 CICE_data_ic_grid.tar.gz 0 CICE_data_new.tar.gz 476 Icepack_data.tar.gz 12412 documents 18224 CICE_data_new-20181107.tar.gz 25136 CICE_data_ic_grid-20181031.tar.gz 32116 CICE_data_forcing_gx1_MONTHLY-20180920.tar.gz 42836 cice_v5.1.2.tar.gz 52704 gx1_50yr_scrip.tar.gz 437868 CICE_data_forcing_gx3_all-20171019.tar.gz 3720128 CICE_data_forcing_gx1_2008-20190821.tar.gz 3720132 CICE_data_forcing_gx1_2008-20171019.tar.gz 3720356 CICE_data_forcing_gx1_2006-20190821.tar.gz 3720360 CICE_data_forcing_gx1_2006-20171019.tar.gz 3721408 CICE_data_forcing_gx1_2009-20190821.tar.gz 3721412 CICE_data_forcing_gx1_2009-20171019.tar.gz 3721952 CICE_data_forcing_gx1_2007-20190821.tar.gz 3721956 CICE_data_forcing_gx1_2007-20171019.tar.gz 3722564 CICE_data_forcing_gx1_2005-20171019.tar.gz 3722568 CICE_data_forcing_gx1_2005-20190821.tar.gz 11489436 CICE_data_forcing_gx1_2005-20190918.tar.gz 11491404 CICE_data_forcing_gx1_2006-20190918.tar.gz 11499100 CICE_data_forcing_gx1_2009-20190918.tar.gz 11503128 CICE_data_forcing_gx1_2007-20190918.tar.gz 11521128 CICE_data_forcing_gx1_2008-20190918.tar.gz 19101520 CICE_data_all-20181107.tar.gz 38897780 CICE_data_new-20190821.tar.gz 57999328 CICE_data_all-20190821.tar.gz

eclare108213 commented 4 years ago

I think this is a good argument for moving at least some of the data to zenodo, as an archive. We could use it (and the DOIs) as a way to keep track of older data sets, although it would take a little work to document which versions of the code the various data sets worked with. I'm not sure if all of the forcing and related code are backwards compatible before v6.1, but the latest set introduces new fields, and so it would be good to somehow link data files with code versions.

If the 'all' files are getting too big, let's think about what the most intuitive/useful/easy-to-maintain way to break them up would be. We can think in terms of time (e.g. years) or testing (e.g. particular grids) or other ideas.

phil-blain commented 4 years ago

Another technical alternative that we should maybe consider would be to use Git LFS, a git extension that permits linking to very big files directly in the repo (the files are stored in a separate LFS repo and the CICE repo would store links to specific versions of these files). This would ease the process of keeping track of which version of the code is guaranteed to work with each version of the input data.

However Git LFS hosting is limited to 2GB on GitHub so another alternative would be needed. I don't know if DOE maybe has such a publicly accessible service ?

dabail10 commented 4 years ago

The problem is that the "all" file is now larger than 50GB compressed. This is because it includes the JRA55 forcing for the gx1 grid. Removing the CORE forcing will save about 20GB. Our disk for this forcing is about 1TB and the CICE Consortium are not the only ones using this area. We have around 200GB here and there is only 100GB remaining on the whole disk.

duvivier commented 4 years ago

@eclare108213 @apcraig @dabail10 @phil-blain Revisiting this topic since the JRAA gx3 additional files. The "all" file is now too large (~75GB) to be easily manageable.

Below is a full accounting of the files we have in their tar+zipped sizes. Icepack_forcing_data (480K) CFS (264K) ISPOL (57K) NICE (37K) SHEBA (116K) CICE_gx1_forcing_data (55GB) grid (2.8MB) ic (19MB) COREII (18GB) WOA (17MB) JRA55 (38GB) --> I'd like to rename these files to have gx1 in each name CICE_gx3_forcing_data (4.1GB) grid (476K) ic (2.2 MB) NCAR_bulk (428MB) WW3 (2.9MB) --> this currently is not in a WW3 directory, but I'd like to file it more formally. JRA55 (3.7GB) CICE_tx1_forcing_data (2.8MB) grid (2.8MB)

My proposal would be to publish these datasets as Zenodo datasets in our consortium community and each new dataset would include each of the individual tar+zipped files listed under each of the headings above. We wouldn't keep a "new" or an "all" file anymore, just update the files that are relevant or new and re-publish a zenodo dataset. Then when we update some files we just update that particular tar+zip file within the different datasets, and zenodo will help us keep track of the individual data as they change over time.

However, there are two possible hitches with this plan.

@apcraig do you think there would be any issues with downloading the data from zenodo for the automatic tests we do?
The gx1 forcing data are currently 55GB, so more than zenodo will publish for a single dataset. This means we'd have to break them up into each individual forcing dataset (i.e. a published JRA dataset and a separate published CORE dataset, etc. etc). I'm concerned this would be more confusing for users if they have to download a lot of individual datasets that aren't all part of a single published dataset. So I'm not sure how to manage that.

I also read about the Git LFS that @phil-blain suggested. I'm curious what others think at this point about a best way forward.

eclare108213 commented 4 years ago

I agree the 'all' files are too big. I've been trying to download them to my laptop, and can't! We can investigate @phil-blain 's suggestion for Git LFS, but we'll still need a place to host the data itself. That could be the earth system grid, but I'm not sure that's the best option (maintenance requirements?).

We could keep only the current data on the ftp site (i.e. what's needed for our test suites) and use zenodo as a data archive. On the other hand, it would be good to have a DOI for the current data too, which would mean that some of the data is in both places unless travis can grab its data from zenodo.

I propose we drop the gx1 COREII data from regular use, and we can also consider fully switching to JRA-55 for gx3 tests. We are in a transition period, moving to JRA-55 and away from the other data sets -- this is a good time to reorganize the data. Separating the major data sources and archiving older stuff makes sense to me. To complete the transition, I'd like to see a full comparison of the output for gx1 in our 'production' model configuration (or a close approximation) and for gx3 from the entire base suite. That's a lot of plots, so maybe just produce timeseries plots for runs at least 1 year long and maps at the end of the runs, comparing JRA-55 results with COREII or NCAR bulk forcing.

Keep in mind that the Consortium only needs the forcing data for testing, and we aren't supporting stand-alone configurations for scientific use. E.g., how much tx1 data do we need, just to make sure the grid configuration is working correctly? (maybe 2.8 MB is the right amount.)

apcraig commented 4 years ago

Even if we stop testing data like COREII, I'm not sure it's a good idea to get rid of it. The input data is needed for older versions of CICE. If we decide to drop some input data, I think, in practice, we should wait a year or two (at least) before removing it from our input data area. Also, renaming data is a little problematic, again due to backwards compatibility. Maybe I am too strict on this point, but if we rename, we can also end up with multiple copies of the same files in input data spaces. If we start doing any of this, maybe we need to have inputdata tar files on a per version basis. And I'm not sure I love that either.

I'll spend a few minutes this morning looking into a few details (like if we can easily/quickly download data from zenodo -- especially for TravisCI and whether there are some other places/methods we could hold data. Could we get space at NCAR outside the CGD ftp site. Maybe on some public disk space in CISL where data could be accessed via rsync, wget, and/or other methods? I would actually advocate having two methods for downloading. One would be multiple tar files. The other would be an rsync approach. The rsync approach makes incremental updates much easier. Tar files are good for initial downloads and easier ability to break the data up a bit (although rsync can work too). I would also consider generating tar files for different situations. For instance, we can have one tar file that contains most of the list above. All datasets under a GB or two should be aggregated together. The larger datasets can be tarred up individually. If we break up the input data, we'll want to have some documentation that maps the input data needed for the different configurations.

eclare108213 commented 4 years ago

If we are going to refactor and/or rename datasets, let's do it before the next release, since we're adding a JRA-55 gx3 capability. The code is ready to go, so getting the data to go with it in order is a high priority. We can change how we handle the data in separate steps:

identify what data sets we want to maintain long-term and which will be archived in case someone wants to use them with an older version of the code
refactor, rename, recombine, etc, but keep it all on ftp for now
evaluate other locations for storing/serving the data
move the data

For the release, let's do items 1 and 2, then work on 3+ after the release. I'm not too concerned about backwards compatibility of the data downloads, as long as we can reproduce the data file structure and contents when needed.

Thoughts?

duvivier commented 4 years ago

I realized that I hadn't formatted that so it was easy to see the divisions I proposed for each Zenodo dataset release. I've re-done it with the more clear groupings.

Icepack_forcing_data (480K)

CFS (264K)
ISPOL (57K)
NICE (37K)
SHEBA (116K)

CICE_gx1_forcing_data (55GB)

grid (2.8MB)
ic (19MB)
COREII (18GB)
WOA (17MB)
JRA55 (38GB) --> I'd like to rename these files to have gx1 in each name

CICE_gx3_forcing_data (4.1GB)

grid (476K)
ic (2.2 MB)
NCAR_bulk (428MB)
WW3 (2.9MB) --> this currently is not in a WW3 directory, but I'd like to file it more formally.
JRA55 (3.7GB)

CICE_tx1_forcing_data (2.8MB)

grid (2.8MB)

A few additional comments -

I think if we publish data to Zenodo we'd want to have the same structure for each of the grids. That is - a file for the grid, a file for initial conditions, a file for all forcing files from a particular origin (e.g. WOA, WW3, etc).
I like the option of breaking up the files more so that we'd be able to update just the files needed and people wouldn't need to update a whole huge zip file with new data.
If we move to just providing JRA55 forcing data for testing and remove the COREII forcing, that would make the whole gx1 group of files small enough to publish through zenodo.
My preference is still for Zenodo since this is the forum we're already using and have the Consortium Community for model releases. If we want DOI info for things we produce I think we should do that all in one place.
I haven't looked deeply into ESGF, but because of above I don't think this is our best option.
I did look more into Github LFS. I am still a bit unclear how it works, but I think the large files are stored on a separate Github LFS server. If we used this that part of the repository would just be stored on that server with their hooks. However, I'm not sure it is right for us. I tried to find documentation about file size limits and didn't find a hard cap, but they do say "Version large files—even those as large as a couple GB in size—with Git". That is small for us. I don't think 50GB is right if they have a limit of, say 5GB. It seemed like it might be more designed for graphic design type files.
If we do decide to go to Zenodo for publishing datasets, I think we should have everything done and ready for not using COREII any more. Are we thinking we'd take out all COREII functionality entirely from CICE? Do we want to do the same with NCAR bulk?
For now it's fine to keep hosting data at the NCAR ftp. Since we're going to stop using the "all" file I've cleaned it up and moved versions of tarballs that are a few iterations old elsewhere. I could test by putting each of these files we want to post on zenodo there and change the wiki to point to each file and group of tarballs by grid. Test the system on NCAR for a bit before a zenodo move. Ultimately I think we want to store all the data one place (NCAR or Zenodo) not host both publicly. As with the links in wikis - having too much repeat just makes it more likely we'll miss something during updates.
I think we still need to figure out if Zenodo is a place our tests can automatically pull from since this would be a big hurdle if it is not. @apcraig or @phil-blain would be helpful for figuring this out.

eclare108213 commented 4 years ago

Ah, thank you @duvivier, that helps. Is each bullet in your data file list a separate file?

apcraig commented 4 years ago

I had a look today too. My sense of git lsf was the same as @duvivier, it's really just a way to point to data stored elsewhere. My worry with zenodo (or maybe it's a plus) is that each time we change the input data, we have to publish a new doi, the zenodo link to get the latest data changes, and so forth. That could create a bunch of confusion in terms of managing the input data. And wouldn't that create multiple copies of the same files over and over? Maybe I'm wrong about how that could work. I don't see how we could publish data on zenodo then continue to append to the same dataset.

eclare108213 commented 4 years ago

I would deprecate the COREII and NCAR_bulk forcing from both the 'active' data sets and from the code (since we wouldn't be testing it). I don't think we're quite ready to do that, unless someone has done (or can quickly do) the test suite and other runs to compare JRA-55 with the old data. I like the idea of publishing the data through zenodo, so that past versions would have their own DOI and could be accessed easily. It's true that in the future, each new zenodo record would contain many of the same files. But that's the model we're using now for the ftp, isn't it? Each time we make a change, we rename the file and keep the old one around?

apcraig commented 4 years ago

As I understand it, what we're doing now on ftp is adding new datasets to the tarball but never removing old datasets. We were keeping old tarballs around but it wasn't really necessary to do so. Again, I like the idea that we have input data files and we never remove any, just add to them forever. That way, everything is fully backwards compatible and we don't have to keep track of which datasets are needed for which versions of the model. At some point, we can remove older datasets, but the idea of moving things around and rapidly replacing datasets scares me in terms of maintenance.

eclare108213 commented 4 years ago

As long as we keep the older datasets in some sort of storage (not necessarily publicly accessible), we should be able to remove those files from active use. If we move data to zenodo, then I think that we could get DOIs for the NCAR data and for the COREII data (if they aren't too big), for consistency and backward compatibility. The important thing to maintain, in my opinion, is the forcing data directory structure. I'd rather not change the names of the individual forcing, grid and initial condition files if we can help it, but I'm not opposed to changing the names of the tarred/zipped files that are downloaded, as long as they uncompress into the correct directory structure. However, since we're adding JRA-55 gx3 data, now would be the time to rename the JRA-55 gx1 data, if that's needed.

duvivier commented 4 years ago

@apcraig @eclare108213 I'll try to address a lot of the comments/questions in the above thread here. It's evolving quickly so hopefully I've covered the main points here.

The current NCAR FTP method is that we add files to the tarballs but never delete files. We also haven't been deleting old tarballs, but I don't know that we need to keep those around forever if they just have the same files in them. In any case, the single "all" tarball with all the current files is unmanageably large and if we keep all those around it will be unsustainable in our designated FTP allocated space. To address these concerns I think we need to figure out a way to break up the data that will be the organization/structure for either the NCAR FTP or Zenodo.
@eclare108213 the bullets in my message above do correspond to individual tar/zipped files. I have been trying to figure out how to organize these in a reasonable way. @apcraig has advocated that we tar datasets that are less than about a GB. The grid and i.c. files are small, so we could group all those a single tar file. But we might also consider that we want to group by grid with all files needed to run a particular grid (which is how I broke it down above). On Zenodo we could then publish these files as "CICE all grid and initial condition files" and then the other groups could be following this pattern "CICE gx1 JRA55 forcing", "CICE gx1 COREII forcing", etc. with just the larger forcing files, which are the real problem we're dealing with in terms of ballooning size.
@eclare108213 proposed 4 steps to deal with data. I agree that we can do steps 1 and 2 now before the release and steps 3 and 4 after a release. My thoughts on these:

I think we have decided that the two to deprecate are COREII for gx1 and NCAR_bulk for gx3. We'd keep everything else. Before we remove these I think we should post online in the same area where we list the features we are deprecating from the model.
I know @apcraig doesn't like this, but the names of the gx1 JRA55 forcing weren't an issue until we are adding gx3 JRA55 forcing. And in the not too distant future we may be adding tx1 JRA55 forcing. In these cases since we have a bunch of JRA55 forcing datasets, I think just for clarity we'd want to name the flies with the grid they go with. The second issue is the WW3 forcing file, moving (or linking?) this into a new directory for WW3 forcing. I don't have any other major issues at this time with the names or organization. But I do think this emphasizes that there should be some accepted naming or organizing standards for new files we add.
From what I've seen I'd prefer Zenodo as our "archive" of where data could live. @apcraig is correct that we'd just have multiple copies of the data with their own DOIs in the Consortium Community in a very similar way that we do with the code releases. See here for an example of a data publication on zenodo (https://zenodo.org/record/3600232#.Xl6nB5NKjUL). Note as you scroll down there are three versions but they're pretty easy to track. I see this as we'd only update when new data files are added. The Zenodo FAQ (https://help.zenodo.org/) indicates that for a single dataset the limit is 50GB but we have unlimited separate datasets within our community. In this case we wouldn't publish 50GB every time, just the files or groups of files that need updating. So we can publish COREII and NCAR_bulk even if we never update them again and we can even let the code move on to eliminate them over time but the forcing files will live on here. Other forcing (e.g. JRA55) could evolve over time. I think we'd want to discuss how we'd publish the groupings (see above for "grid and ic", "gx1 JRA55", "gx3 JRA55", etc about possible divisions).

I know that's a lot. But I think we might be converging?

phil-blain commented 4 years ago

I don't have a strong opinion on the organization of the forcing or the archival strategy.

I think considering all of the above Zenodo seems to be the right solution. I tried

wget https://zenodo.org/record/3608230/files/LAKE2.0_atmospheric_forcing_and_setup.zip

and it downloaded without problem so I don't think it's going to be a problem for the CI. It's just a matter of scripting it correctly. We could have a file at the base of the repo that records the up-to-date urls to Zenodo. This file would get updated when we publish new datasets, and we could write a small script that reads this file and download and unpacks the current forcing files for use with Travis. Hopefully this will even address the recurring problem of the UCAR ftp failing, causing the Travis builds to fail.

apcraig commented 4 years ago

I understand how creating multiple versions of tar files (in time) and putting them on zenodo helps create tar files, but I don't see how it helps the community. Lets think about use cases

New user downloads some version of CICE (not necessarily the latest). How do we tell them which input data files to download? The new tar files are, by definition, NOT backwards compatible.
Current user downloads a new version of CICE. Do they have to delete their old input data directory, create a new one, or just tar on top? If we rename datasets, do we expect them to keep multiple versions of the same datasets? How do they know what to clean up and when? Do they have to download the complete large files when just an incremental update will be much faster and easier? Are we getting rid of the "new" files? Users will potentially have input data directories that are much larger than needed and the downloads will be much slower than necessary.

To me, moving in this direction where we do not support backwards compatibility and we do not guarantee filenames are static will significantly increase confusion as well as maintenance. If we do move forward this way, there are several things we have to do

Make sure travis can successfully download a smallish dataset in order to run.
Provide a history of tar files and versions, documenting changes and additions, maybe even providing guidance about what could/should be removed from the current input data space as the model version is changed. We need to formalize this process.
Make sure we can download all the new tar files and untar them in a clean inputdata directory and check the code runs fine.
Make sure we can download all the new tar files and untar them in a working input data directory and check the code runs fine.

Breaking the current big tar file into smaller ones is necessary. That's clear. Some of the other changes and the confusion and maintenance required to keep track and communicate information clearly is concerning to me.

What if we just create incremental tar files in time. We could even add a small file to each tar file that provides a history of the tar files that were downloaded. So, maybe we'd have the tar files outlined above as a starting point. Then we'd add a new tar file for the new JRA55_gx1 filenames. And maybe a tar file that updates the location of the ww3 fsd file. Each tar file would have a reasonable name and be date stamped. And everytime you untarred a file, you'd get a small file dropped into a "history" directory that is untarred with the tar. That way, it's easy enough for someone to look in the "history" directory and see which files they've downloaded up to now. Separately, maybe we should add a README that evolves in time and is added to each tar file. That would document things like "You can delete the JRA gx1 files in this directory if you are working with CICE > 6.2.3". And/or maybe we need a README in each version of CICE that documents exactly what tar files are needed to run with this version of CICE. If we want to move away toward more dynamics input data spaces, we need some clear way to document what's needed for different versions.

Finally, as @phil-blain points out above, you can see even keeping track of which file Travis needs to download becomes much more complicated. I want to make sure we have a solid long term plan with robust process in place to minimize our maintenance and also make sure the community is not confused.

duvivier commented 4 years ago

I understand how creating multiple versions of tar files (in time) and putting them on zenodo helps create tar files, but I don't see how it helps the community. Lets think about use cases

New user downloads some version of CICE (not necessarily the latest). How do we tell them which input data files to download? The new tar files are, by definition, NOT backwards compatible.

I think this would be important for the user to determine. If they are testing a particular grid (e.g. gx1) they should go get the latest gx1 files we have on zenodo.

Current user downloads a new version of CICE. Do they have to delete their old input data directory, create a new one, or just tar on top? If we rename datasets, do we expect them to keep multiple versions of the same datasets? How do they know what to clean up and when? Do they have to download the complete large files when just an incremental update will be much faster and easier? Are we getting rid of the "new" files? Users will potentially have input data directories that are much larger than needed and the downloads will be much slower than necessary.

I am not in general in favor of renaming datasets. I think this is an oversight that should have been noticed when JRA55 gx1 files were added, but I didn't and neither did anyone else. I just think that as we add more JRA55 datasets it could get confusing. If it's going to be a huge headache, then we don't have to rename them to include gx1.

The same type of oversight could have happened with the wavewatch file. If we leave it where it is and assume no new wavewatch data will be added then it isn't a big deal, but I am assuming that more data could be added some time.

I think that if we make a few big changes at once it will cause some initial headache, but we can try to widely publicize it to minimize some of that. I'd guess that within about 6 months many of the regular CICE users would have made the conversion. But there may be a little confusion at first for sure.

My vote, as shown with my data breakdowns above, is that we have individual files in a release so that we minimize how many new files there would be to download if we do update some input files. I'm not sure how this would cause big files with big input directories that are slow. I personally don't like the "new" files because we don't well-define what new means and it could be something different for each user. If we say it's all files <6mo old, then a regular user could have been on a development hiatus for 9 months and they would miss the "new" file release anyway. That's why I'd prefer to break it down by data group. That is "all gx1 JRA55", "all gx1 COREII", etc.

To me, moving in this direction where we do not support backwards compatibility and we do not guarantee filenames are static will significantly increase confusion as well as maintenance. If we do move forward this way, there are several things we have to do

Agreed, this isn't something I want happening often. I don't think we should make this a common practice. That's why I'm interest in a protocol now.

Make sure travis can successfully download a smallish dataset in order to run.

Provide a history of tar files and versions, documenting changes and additions, maybe even providing guidance about what could/should be removed from the current input data space as the model version is changed. We need to formalize this process.

Make sure we can download all the new tar files and untar them in a clean inputdata directory and check the code runs fine.

Make sure we can download all the new tar files and untar them in a working input data directory and check the code runs fine.

Breaking the current big tar file into smaller ones is necessary. That's clear. Some of the other changes and the confusion and maintenance required to keep track and communicate information clearly is concerning to me.

Agreed, it isn't trivial.

What if we just create incremental tar files in time. We could even add a small file to each tar file that provides a history of the tar files that were downloaded. So, maybe we'd have the tar files outlined above as a starting point. Then we'd add a new tar file for the new JRA55_gx1 filenames. And maybe a tar file that updates the location of the ww3 fsd file. Each tar file would have a reasonable name and be date stamped. And everytime you untarred a file, you'd get a small file dropped into a "history" directory that is untarred with the tar. That way, it's easy enough for someone to look in the "history" directory and see which files they've downloaded up to now. Separately, maybe we should add a README that evolves in time and is added to each tar file. That would document things like "You can delete the JRA gx1 files in this directory if you are working with CICE > 6.2.3". And/or maybe we need a README in each version of CICE that documents exactly what tar files are needed to run with this version of CICE. If we want to move away toward more dynamics input data spaces, we need some clear way to document what's needed for different versions.

I'm open to a new protocol. I'd put together a script and Readme etc. for putting together these files. But when I stepped back those weren't used and so some other inconsistencies (i.e. permissions) have arisen as a result. If we went with zenodo and also published our file for making the tar files we could more easily as a group maintain these repositories. It wouldn't need to be me or Dave making the changes to the NCAR server. I think this is a benefit. If I'm swamped or out of town and Dave is busy with X, then another consortium member could do the file additions/changes and the whole workflow wouldn't be waiting on us. If we can be sure everyone who creates the files uses the same script that would be ideal. I have written one we can start with, I'm sure you can write a more efficient script as well or make additions.

Finally, as @phil-blain points out above, you can see even keeping track of which file Travis needs to download becomes much more complicated. I want to make sure we have a solid long term plan with robust process in place to minimize our maintenance and also make sure the community is not confused.

Agreed, we should hash this out now rather than piecemeal changes that may not work well.

duvivier commented 4 years ago

@apcraig and I discussed this on the phone this morning. I think we have a path forward:

I'm going to work on putting together the groups of files we think we want to start hosting on Zenodo. We'll do a trial run hosting these on the NCAR FTP server with information on the Consortium wiki. Once I have these prototyped it would be good to get feedback from other Consortium members. I intend to have this done in the coming days so it will be live before the next release.
In general we agreed that in the long term we would like to release and host data at Zenodo for user's but once published there the data would be relatively static (i.e. not lots of new versions unless there are bugs found). We also want to implement some naming and organizing standards to be sure any new files that are added will be easily findable.
We agreed that older datasets (COREII, NCAR_bulk) could be published on zenodo and then not changed, but we can migrate the model infrastructure to using them less. One challenge will be appropriately documenting on the Wiki which versions of the model use which forcing files so users don't get so confused.

duvivier commented 4 years ago

@eclare108213 @apcraig and I discussed this by phone yesterday and I've made a first stab at items 1 and 2:

identify what data sets we want to maintain long-term and which will be archived in case someone wants to use them with an older version of the code

refactor, rename, recombine, etc, but keep it all on ftp for now

evaluate other locations for storing/serving the data

move the data

I've put together the following page with information. Eventually the links would be to the zenodo release pages and list the DOI, but for now they are just to the FTP site. I know more information needs to be added, but we an all adjust this page as we finalize details. https://github.com/CICE-Consortium/CICE/wiki/IN-DEVELOPMENT:-INPUT-DATA

I've grouped the data into individual tar files I think make the most sense to publish. Each of these data sets of files would be published to Zenodo. The datasets themselves would be static and we'd only release a new version if we found problems with them. We can publish the "older" datasets that we want to deprecate, but they'd still be available if people want to use them with older model versions, but we can shift to the newer published datasets with the code.

I think one thing that needs to be improved is information for users about which datasets they must download for testing and information about what model versions they can be used with. @apcraig will be very helpful in determining the exact information (hash, etc.) that we should provide.

A few more details: we are not re-naming the JRA55 gx1 data as I had suggested. However, I have moved the WW3 forcing to a new directory and there is a softlink from the file location in the directory to where it is currently located. This is something that we may want to eliminate over time, but the exact method for doing so is unclear at this point.

At this time I think it would be helpful for others to look over what I've put together and provide feedback about organization. I think we can basically prep everything from the FTP site and with the new information on the wiki, and then once we try publishing data on zenodo we can make the switch on the wiki page.

apcraig commented 4 years ago

I think the new wiki page looks great. I'm fine with the tar file organization and other changes in datasets. In terms of the table at the top of the page, what do you think will go into the wiki description links? Is it just going to point to the sections below?

I think we should add a new column in the table called notes. There we can add some info about if/when a dataset is added or deprecated in the implementation in terms of hashes, dates, and or release versions.

Are we thinking that everytime we add a new tar file to zenodo, we will add a line to the table? And we'll never remove any table line?

We might also want to add a further column in the table called something like "latest release". It could be blank if it's not relevant to the current release or outdated, but otherwise, we could have something like "for gx3" and things like that in that column. Or even just a "*" and "for gx3" could be in the notes column? We want to be able to highlight datasets that are relevant to the current release. Would you like me to prototype that a bit on the page?

duvivier commented 4 years ago

@apcraig, Thanks for the feedback.

I think the new wiki page looks great. I'm fine with the tar file organization and other changes in datasets. In terms of the table at the top of the page, what do you think will go into the wiki description links? Is it just going to point to the sections below? I wasn't sure about the wiki links. I think we need to provide a description, citation, whatever for each dataset. But perhaps that description is on zenodo? I do think some information we currently provide (i.e. variable names) isn't necessarily something we need to put on zenodo but would be useful for users. I'm open to other ideas.

I think we should add a new column in the table called notes. There we can add some info about if/when a dataset is added or deprecated in the implementation in terms of hashes, dates, and or release versions. I think this is fine to add a notes column. As per my comments above, maybe we should just have "notes" for each?

Are we thinking that everytime we add a new tar file to zenodo, we will add a line to the table? And we'll never remove any table line? Yes, this is what I was thinking. Only keep adding, never remove. But in the "notes" or "info" column we can add the information about when a dataset effectively becomes deprecated.

We might also want to add a further column in the table called something like "latest release". It could be blank if it's not relevant to the current release or outdated, but otherwise, we could have something like "for gx3" and things like that in that column. Or even just a "*" and "for gx3" could be in the notes column? We want to be able to highlight datasets that are relevant to the current release. Would you like me to prototype that a bit on the page? I'm open to this. I just don't know if it should be included in the above "notes" or "info" line. Maybe we should have a subsection for each of the datasets and then in that subsection we can note what release it is for, etc. but it isn't squished in a table.

Thoughts?

apcraig commented 4 years ago

I agree about making the table readable. I just made some changes to the table just to play a bit. I changed the size to Mb throughout and I added a new column on status. The idea here is *=active, + means added and gives the date+version, - means removed and gives the date+version, u means untested. We need a small key and can iterate a bit. Is this effective/useful? Happy to change things back including the size column, just trying to figure out what we can provide in the table and how to best report the info.

I agree the links should point somewhere to a complete description. The ww3 description should mention that it's used with the floe size distribution. I also agree the description should be on zenodo. Is it possible to update the description of a published dataset on zenodo? It would be nice to know we can add information later as we realize something is important or useful.

duvivier commented 4 years ago

@apcraig Thanks for looking this over. I have fixed the links in the description to now point at descriptions lower in the wiki. I'm not sure if we want to keep the full description in the wiki or zenodo... I like that the wiki is more easily update-able without the hassles of zenodo we've run into (e.g. adding authors, etc.). This is why a detailed description might be more appropriate here. This is something we should test or work with @eclare108213 to test since it's through zenodo logins.

I think some of these need some additional information and I believe @rallard77 is going to provide some information about how to create more years of gx1 JRA55 forcing (similar to what we have documented for COREII now).

Also, I think I understand the table key now for "status", but I don't find it immediately intuitive. I think we'd need a key. I tried to add this, but I'm not sure it's really better. I wonder if we should add the date added or code version in the information at the wiki link? Or is there another way we could indicate status? Instead of * maybe say in words "Current master" and then once some get deprecated say "Pre 6.1.x?" or something?

apcraig commented 4 years ago

I'm happy to have some different nomenclature for the status. That was just the first thing that came to mind, and I know it's not very good. I think we should continue to iterate until we have something better. I think it would also be good to have that information in the link portion, but a user doesn't want to have to walk thru all that description if they just want to know what's current/needed/obsolete/etc. We need to have a way to communicate that. Lets try some other options and see if we can get something that works better.

I'm open about the zenodo vs wiki. I think a requirement is that we can update the description after it's posted. If that's a problem on zenodo, we may need/want to have the wiki as well. I wonder if the links should be in separate pages. Or maybe just have the current input data on the current page and older datasets are relegated to an obsolete page but the links and descriptions still exist.

duvivier commented 4 years ago

Here is the wiki page where I'm testing the groupings of data and documentation about them. https://github.com/CICE-Consortium/CICE/wiki/IN-DEVELOPMENT:-INPUT-DATA#jra55-forcing-1

duvivier commented 4 years ago

@apcraig please look over the wiki page. I've split up the files as we discussed on the telecon last week and added some info or citations to the descriptions of each dataset. I also tried breaking up the table rather than having one big table, but I'm not sure this is better. The full table is copied at the end of the page so we can decide if we'd prefer that. Finally, I'm not sure all the links will work to the wiki descriptions yet because I've been shifting things around, but this is a final details type thing to fix.

If you like this layout then I think we can try the zenodo publication of some minor dataset. Elizabeth says there is a zenodo sandbox option. Are you willing to be the second "master" zenodo user and give this a try?

apcraig commented 4 years ago

@duvivier I think the updates look good. I personally like the single table at the bottom better EXCEPT I wish we had more control over how it looked (like making the header lines a different color from the other table lines), but I know we don't have that control. So I think both work fine. I'm happy to go with the split table. I do have a couple other thoughts looking at the table.

Why don't we combine grids and initial conditions into one tar file?

And for the compatible column, I would write CICE6.1.0+. I worry that without the plus, it might seem it's only OK for the named version. That leaves it open later for CICE6.1.0 - CICE7.2.3 in case something is later dropped. We could also think about combining the compatible and date into a single column. So "all" is good for most. But then we could have a format like

+CICE6.0.1 @ Sept, 2019 -CICE7.2.3 @ May, 2022

Finally, I'm happy to help with the zenodo stuff. I sort of wonder if it might be better for you to take on that role, especially in terms of NCAR managing the input data. But I'm more than willing to do it if it makes sense.

duvivier commented 4 years ago

@apcraig I thought we'd decided to keep IC and grid files separate so that we wouldn't have to update everything if just something incrementally changed. So, for example, when we add tx1 initial conditions we can just add a new dataset, we don't need a new version of the grid files though.

I re-combined the table. I agree the separated out one was not ideal, but I also wish we had more control of the table format. I'm pretty happy with the table and the info now.

@lettie-roach, can you check out my description of the WW3 forcing data on the CICE wiki and let me know what might need to be added there? https://github.com/CICE-Consortium/CICE/wiki/IN-DEVELOPMENT:-INPUT-DATA#ww3-forcing

One question for the data publications in general is that with zenodo I think you have to have an author. What do we do for the author for data that we've just modified (e.g. for JRA55?). It doesn't seem right that we're the author exactly, but I'm not sure what else would make sense.

@apcraig, @eclare108213 the main hesitation I have about me managing the zenodo aspect of data publication is that when my funding for the consortium liaison comes to an end in Sept I'm not sure what my role might be with this. In that case, I think it is good that we're moving away from NCAR solely hosting data since Dave will be stretched more thin. Since Elizabeth already manages zenodo things and you'll almost certainly still be the code-wizard, I guess I figured it would be best for continuity if one of you guys managed that. But I'm open to doing it now if you think that's best.

apcraig commented 4 years ago

Just to clarify, what I was proposing was to maybe create a gx3 grid+ic file, a gx1 grid+ic file, and a tx1 grid+ic file. My guess is that the grid and ic files will always be a case where we just continue to add new files. Since grid and ic are both relatively small, I think it's fine to combine them together for a given grid. But happy to have things as they are. Thanks.

lettie-roach commented 4 years ago

Hi Alice,

I would put something along these lines:

WW3 forcing:

Surface ocean wave forcing derived from the Wavewatch III (WW3) model. This is a single day of the wave spectral forcing that is necessary for the floe size distribution (FSD) within CICE. It is provided only for testing purposes and should not be used for publications. Users should produce their own spectral wave input.

Citation for WW3 model: WAVEWATCH III Development Group. (2016). User manual and system documentation of WAVEWATCH III, version 5.16. Tech. Note 329, NOAA/NWS/NCEP/MMAB, College Park, MD, USA.

The data was derived from the wave-ice coupled run in:

Roach, L. A., Bitz, C. M., Horvat, C., & Dean, S. M. (2019). Advances in modelling interactions between sea ice and ocean surface waves. Journal of Advances in Modeling Earth Systems, 2019MS001836. https://doi.org/10.1029/2019MS001836

On Mon, 16 Mar 2020 at 20:35, Alice DuVivier notifications@github.com wrote:

@apcraig https://github.com/apcraig I thought we'd decided to keep IC and grid files separate so that we wouldn't have to update everything if just something incrementally changed. So, for example, when we add tx1 initial conditions we can just add a new dataset, we don't need a new version of the grid files though.

I re-combined the table. I agree the separated out one was not ideal, but I also wish we had more control of the table format. I'm pretty happy with the table and the info now.

@lettie-roach https://github.com/lettie-roach, can you check out my description of the WW3 forcing data on the CICE wiki and let me know what might need to be added there? https://github.com/CICE-Consortium/CICE/wiki/IN-DEVELOPMENT:-INPUT-DATA#ww3-forcing

One question for the data publications in general is that with zenodo I think you have to have an author. What do we do for the author for data that we've just modified (e.g. for JRA55?). It doesn't seem right that we're the author exactly, but I'm not sure what else would make sense.

@apcraig https://github.com/apcraig, @eclare108213 https://github.com/eclare108213 the main hesitation I have about me managing the zenodo aspect of data publication is that when my funding for the consortium liaison comes to an end in Sept I'm not sure what my role might be with this. In that case, I think it is good that we're moving away from NCAR solely hosting data since Dave will be stretched more thin. Since Elizabeth already manages zenodo things and you'll almost certainly still be the code-wizard, I guess I figured it would be best for continuity if one of you guys managed that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CICE-Consortium/CICE/issues/348#issuecomment-599860966, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGIYTCOFMJIW3Z4M2F7Q25DRH3VXZANCNFSM4IKWR6OA .

duvivier commented 4 years ago

@lettie-roach, Thanks! I've added that info now. :)

@apcraig, I think I'd prefer to keep the grid+ic files separate if possible. Is your concern just keeping track of multiple files to download? I agree that these are the files most likely to be changing over time (both grid or IC, though maybe more likely IC?), so perhaps you're feeling like just having one file that changes in time would be least confusing?

apcraig commented 4 years ago

@duvivier, Happy to keep the grid and ic files separate if you prefer that. My thought was having a single file may be a little easier, they are both pretty small, they will both probably be incrementally added to over time, so not much benefit to separate them. But it's not an important point. Thanks!

eclare108213 commented 4 years ago

my funding for the consortium liaison comes to an end in Sept

NOOOOOOOO! We need you! Have there been discussions there about what NCAR's support for the Consortium going forward might look like? I should first ask whether it's something that you'd like to continue doing, and if so, then I'll put on my project advocacy hat. I'll understand if you're looking for something else to do (in which case, maybe I could help with that too). e

dabail10 commented 4 years ago

I will take this discussion offline.

duvivier commented 4 years ago

@apcraig I've merged the grid and IC files. I think the next step is to try publishing some of these on Zenodo.

eclare108213 commented 4 years ago

A few comments:

The table looks great (but check the version number for gx3 JRA55).
An rsync option would be nice but doesn't have to be implemented immediately.
I like the idea of having a README in the tar files, but it would need to have a date as part of the filename to keep it from overwriting previous READMEs when untarring. It probably shouldn't contain a whole lot of information, maybe just link back to the data info on either wiki or Zenodo, so maybe it's not that useful. Maybe just put one README at the top of the data directory structure, linking back to the Consortium resources page?
A script for creating the tar files makes a lot of sense. I'm not sure where it should be kept, but it should be linked from our release instructions on the wiki.
Zenodo records can be easily updated after publication. Zenodo creates a unique DOI for each new dataset, but they can be linked to previous DOIs and labeled as 'updates'. We do this now for both Icepack and CICE.
A big question for Zenodo is who the author should be. I don't see why we couldn't have it be "CICE Consortium", but then I'm not sure what the institution should be.

I like the way this is turning out!

duvivier commented 4 years ago

@eclare108213 - comments in line below

The table looks great (but check the version number for gx3 JRA55).

I just checked this and it looks like in our release notes it was 6.0.2 in 10/2019 that had the JRA55 for gx3. We didn't actually include info about JRA55 for gx1 in the release notes for CICE6.1.1.

An rsync option would be nice but doesn't have to be implemented immediately.

@apcraig and I talked about this and I don't know of any of our options where that is do-able. We agreed this would be the optimal solution

I like the idea of having a README in the tar files, but it would need to have a date as part of the filename to keep it from overwriting previous READMEs when untarring. It probably shouldn't contain a whole lot of information, maybe just link back to the data info on either wiki or Zenodo, so maybe it's not that useful. Maybe just put one README at the top of the data directory structure, linking back to the Consortium resources page?

I like the idea of a README at the top level of both the CICE_data and Icepack_data directories and I've now added that to each of the tarballs (in progress, will be updated to FTP soon). The README just points to the resource index with the following text:

"The latest information about CICE forcing data and files can be found at the GitHub Resource Index: https://github.com/CICE-Consortium/About-Us/wiki/Resource-Index under the "Input Data" link."

A script for creating the tar files makes a lot of sense. I'm not sure where it should be kept, but it should be linked from our release instructions on the wiki.

I'm not sure either where the best place is. Since I use the same script for CICE and Icepack data, maybe in About-Us? Or we could break it out and store it in each repo under each "doc" directory along with details in the release instructions. @apcraig WDYT?

Zenodo records can be easily updated after publication. Zenodo creates a unique DOI for each new dataset, but they can be linked to previous DOIs and labeled as 'updates'. We do this now for both Icepack and CICE.

A big question for Zenodo is who the author should be. I don't see why we couldn't have it be "CICE Consortium", but then I'm not sure what the institution should be.

I think testing in Zenodo is the last step since we're pretty close here with the formatting and documentation. I can try doing that, but I'll need the CICE login info from you @eclare108213 (we can arrange that offline). I guess in addition to the release notes about making files, we should include info like how to do this data publication in the release notes too. And let's just list "CICE Consortium" as the authors and maybe also as the institution? I don't know, but it's something I'll experiment with.

apcraig commented 4 years ago

This all sounds good.

We need to have an official complete input data space that we use to create the tar files. Maybe that's on cheyenne or izumi or maybe somewhere in cgd. It cannot be the ftp directory. The script for creating the tar files should just be kept at NCAR somewhere. I don't know that we need to check it into any repo, but we could have a note on the wiki somewhere that says where the file is. The script and input data should be somewhere unscrubbed and we probably should have a backup place on the NCAR hpss.

I'm also willing to deal with the zenodo thing. I can play around a bit this weekend or early next week. Can someone summarize just a bit what we're planning. I assume we want it to be part of the CICE-Consortium project on zenodo. Do we want to test it first on the zenodo test space? Is there just one login for zenodo for the Consortium? Are we going to share that login? How do we want it organized? Is each tar file it's own DOI? How are we going to organize versioning? Could we have an Icepack Input Data DOI and a CICE Input Data DOI and then have all the tar files versioned under that? We want to organize the data and versions in a way that makes some sense. I don't think we want each tar file to be it's own independent DOI, do we? If so, is there another hierarchy somehow that allows us to create a subproject in zenodo and then have multiple unique DOIs under that?

eclare108213 commented 4 years ago

I don't think each tar file needs its own DOI - we can have multiple files under a single DOI. But I do think we need more than one DOI for the data. For instance, we could have a primary DOI for CICE data and one for Icepack data, and then updates to the data after that are linked to the original DOIs (zenodo has various labels for this, like 'is an update of' or supplement, etc) but get their own DOI numbers as they are added. We could make the 'basic' input data be the base DOI and then add the various types of data (COREII, JRA55, etc) separately under the same lineage. Let's take the login conversation offline.

duvivier commented 4 years ago

@apcraig I think we're ready for you to try the zenodo thing.

I have essentially finalized the wiki page (https://github.com/CICE-Consortium/CICE/wiki/CICE-INPUT-DATA-V2). Other than adding the DOI and zenodo links I think it's ready for production.

I have uploaded all 9 of the files we'll provide to the NCAR FTP. I've tested and all work. You should be able to download them directly that way as needed.

We'll also need to change Travis to point to these files (eventually).

In terms of your Zenodo questions: 1) I think we want to test first in the zenodo test space. Maybe try something small (e.g. all the grid_ic files first)? 2) I think that zenodo will take care of the versioning once we create an entry. So I don't think we need to worry about that now. 3) I think because any one dataset can only be 50GB, we'll need different datasets for the forcing data for gx1 at the very least. I suppose we could add all the gx3 with one DOI, but I was thinking we'd want the same format for all grids in terms of how we publish the data. 4) Here's a link to the CICE Consortium group in Zenodo (https://zenodo.org/communities/cice-consortium/?page=1&size=20) I think once we publish a dataset and say it's part of that community, it will automatically be associated with our group. We just want to tag these as datasets rather than software (e.g. https://zenodo.org/record/3718934#.XnVC3NNKiB0).

Let's start with that. If you get a chance to play this weekend or next week with it, then feel free to send any test pages my way to look over as well. :)

apcraig commented 4 years ago

Sounds like a good plan. Once I get the login info, I'll give it a try.

duvivier commented 4 years ago

I think we can close this issue now.