DOI-USGS / pgmtl-data-release

A repository for data release scripts and workflows for releasing process-guided meta-transfer learning predictions
Creative Commons Zero v1.0 Universal
2 stars 3 forks source link

Data review #17

Open matthewross07 opened 3 years ago

matthewross07 commented 3 years ago

Overall

I think the overall webpage, data, and metadata is well done and will likely lead to successful download, use, and reuse by folks reproducing the data. My notes for each data product are below:

Lake information

1) A minor note, but why use shapefiles? these are often harder to track, proprietary, and come with multiple files. Geopackage (.gpkg) is faster, open, and comes with a single file.

2) The lake metadata column names either need more explanation (what is ws_mean?) or a link to the original dataset where these metadata comes from. (is this lake cat data?)

Lake temp observations

Model config

Model inputs

No major issue, though the plethora of zips would bother me if I wanted all the data. Is this some kind of limit on how much data can be in a single object? Why not just put all these data into a single (much bigger zip)? Or give user both options like here's ALL the data and here's the broken apart data if you know where you want to work.

Model predictions

This is where I struggled the most to evaluate what each dataset does and why there are so many different data objects. Can these not be unified slightly more to decrease object proliferation? Like just all pball predictions in one and all pbmtl in another? Looks like those objects would be ~ 1 gb and I just think it will be easier for an end user.

Other than that, the data is clear and well documented.

Model evaluation

Again, I prefer one zipped folder with sub datasets inside of it instead of so many separate (relatively small) zipped folders. But fundamentally what is in the data is clear and understandable.

Final comments

Overall this is an exciting dataset, a cool method, and a nice paper/approach. I'm excited to try doing something similar with remote sensing of water quality!

However, I think that the number of data objects could be cut back substantially which will make the data easier to work with and clearer. Additionally, I think a tutorial showing folks how to use your PGDL and MTL code with the datasets you published would amplify the value of this data by many times. Maybe you already have that in a separate code release, but showing us exactly how you go from all this raw data to your predictions would be really useful.

Nice work and if you need additional comments let me know,

Matt

jordansread commented 3 years ago

@matthewross07 great review and feedback. I am working on changes and responses now.

The lake metadata column names either need more explanation (what is ws_mean?) or a link to the original dataset where these metadata comes from. (is this lake cat data?)

One thing that I think would help is to make it clear in the data summary section that the metadata file is where to see detailed definitions of each column/attribute of the data files. Making sure users know to look at this file for the lake metadata item is critical to understanding what each of the many column headers mean, especially since they use shortened names that aren't intuitive if the pattern isn't clear.

A minor note, but why use shapefiles?

This is mostly a carryover of convenience and support of this format from sciencebase, although geojson has been supported in the same way for a while now and there is not a good reason that we haven't switched. But, one of the reasons to use a shapefile (or geojson) with the current system is that sciencebase is able to create extensions for these file types automatically that create a WFS (web feature service) that can be used to do all kinds of things, including support downloads in different formats, including JSON, SHAPE-ZIP, GML3, JSONP, WFSKMLOutputFormat, CSV, and GML2 (see getFeature ResultFormat here for example)

Can these not be unified slightly more to decrease object proliferation

These decisions were based more on practical limitations for uploading/downloading and the fairly high failure rate for uploading and downloading large files on sciencebase, but I agree it isn't ideal. For the inputs in particular, we'd like to take advantage of cloud-optimized formats in the future, which could support a single file but still give users the ability to only download their area of interest (which is supported currently by our spatial grouping of files, since many users of the data end up being interested in a single county instead of the whole thing). More details about those changes are here

Will be working on the rest of the questions/suggestions, but wanted to share these first. Thank you!

matthewross07 commented 3 years ago

Thanks Jordan!

My bad for not finding that column definition folder, it was probably obvious, but yea making it super clear where those definitions can be found would be great.

Shapefile reason makes total sense. Hadn't though of the instant visual aspect.

And understand re: object proliferation, I figured it was something like that. Love the future download capacity. A clear need for so many NASA/USGS fed data (download only what you care about).

Thanks!