IQSS / dataverse-pm

Project management issue tracker for the Dataverse Project. Note: Related links and documents may not be public.
https://dataverse.org
0 stars 0 forks source link

NIH ODSS supplement (NetCDF) #22

Closed sync-by-unito[bot] closed 8 months ago

sync-by-unito[bot] commented 1 year ago

FAIR Principles for Geospatial Data

Aims from National Cohort Studies of Alzheimer's Disease, Related Dementias and Air Pollution, Award#3R01AG066793-02S2:

Depositing Wildfire Exposures Dataset into Harvard Dataverse

Aim from Award#3RF1AG071024-01S1:

Update Summary: November: Aim 1. We identified three possible options that would aid in automatic metadata extraction and generation. The next step would be to decide if / what tool we want to use and build on for the metadata extraction. Aim 2. We had meetings with the experts from the Center for Geographic Analysis. We have also interviewed a few researchers (and plan to interview more) with the same goal. The next step is to make a decision whether to support solely geospatial files or also high-dimensional files that are used in other fields (like genetics).


┆Issue is synchronized with this Smartsheet row by Unito

pdurbin commented 1 year ago

FAIR Principles for Geospatial Data

Aims from Award#3R01AG066793-02S2:

https://reporter.nih.gov/project-details/10594215 has more detail and the following context:

Project Summary Background. The specific aims of the parent grant (R01 AG066793) are to conduct national epidemiological studies of Medicare and Medicaid claims to estimate the effects of long-term exposures to air pollution on Alzheimer’s disease (AD) and related dementias (ADRD) hospitalization and disease progression (Aim 1), to apply machine learning methods to identify co-occurrence of individual-level, environmental, and societal factors that lead to increase vulnerability (Aim 2), and to develop statistical methods to disentangle the effects of air pollution exposure from other confounding factors and to correct for potential outcome misclassification (Aim 3). The parent R01 relies on a wide range of epidemiological data ranging from environmental exposures, to claims data, to meteorological and socioeconomic factors. While we curated massive amounts of data for the parent R01, the data has not been deposited to a public data repository and we have not made it publicly available. Overall Goals. With this supplement our goal is to enable effective dissemination and reuse for high-dimensional high-volume data, including the data products from the parent R01. These goals will be achieved by forming a new collaboration and partnership with Harvard dataverse, expanding their current capacity to store and share geospatial public health data. Our specific aims are to: implement automatic metadata extraction for high dimensional dataset formats NetCDF and HDF5 (Aim 1), implement an integration with Jupyter Binder that allows exploration and viewing of complex high-dimensional data from Dataverse (Aim 2), enhance reuse, community engagement and reproducibility of R01 research with the demonstration of data analysis using synthetic CMS claims data (Aim 3). Impact. Each new feature in the Dataverse software platform, such as the ones proposed in Aims 1 and 2, is typically propagated to all 77 Dataverse installations, which enhances their impact worldwide. Dataverse also has a vibrant community of open-source contributors and digital libraries, and organizes annual community meetings at Harvard that last for several days. We will use this platform to promote the work in this supplement and engage the community and attend other workshops and conferences with the same goal. We will closely follow the impact of these developments through an existing Dataverse collaboration with the Make Data Count project, which provides usage metrics standardization (such as the number of views, downloads, and citations of data), and enables us to self-modify and improve our data releases. This supplement will also have a direct impact on the parent R01, enhancing reuse, community engagement and reproducibility which will in turn lead to more robust epidemiological conclusions (Aim 3).

Depositing Wildfire Exposures Dataset into Harvard Dataverse

Aim from Award#3RF1AG071024-01S1:

https://reporter.nih.gov/project-details/10593837 has more detail and the following context:

"demonstrate the transformed data in an AI/ML application to predict wildfire PM2.5 exposure for California (Aim 3)." Share specific dataset that the group prepares about wildfire predictions. NetCDF format.

mreekie commented 1 year ago

@atrisovic and I met on 2022-10-03 and started a brainstorming doc for this project.

For now, that doc is in a folder called NetCDF because "NIH ODSS supplement" doesn't exactly roll off the tongue. 😄

All are welcome to read and comment on anything in that folder!

Since NetCDF support is part of the project, this existing issue is related:

I'll paste from that issue an image I added after our (short) brainstorming session that explains NetCDF a bit. It comes from a .nc4 file at https://github.com/energy-policy-institute-uchicago/xarray-notebooks/blob/master/xarray-basics.ipynb

193612803-123677b9-8cc4-433c-af95-583b3e0c94ca

Note that latitude and longitude appear in the image above. Part of our plan is to extrace geospatial data (perhaps in GeoJSON format, so it can be previewed) from NetCDF files. We plan to reach out to people interested in geospatial features in Dataverse and have started composing a draft email to send to the main Dataverse list.

In short, it's early days for this project but it's an exciting one! Please feel free to get in touch with us at https://chat.dataverse.org

pdurbin commented 1 year ago

@atrisovic and I had a meeting yesterday: https://docs.google.com/document/d/1zSf6SkMLOJ0h7clc6jM8m0CtvHN4w22Pm_qg0KX356o/edit?usp=sharing

@philippconzett seemed interested in NetCDF, and posted this: "We/DataverseNO are interested in Dataverse support for NetCDF. Mostly in the context of data sharing (using data server technology and APIs, e.g., OPeNDAP) and metadata sharing (using more traditional APIs). For example, we already have and will be getting data that preferably should be able to share data and metadata following the SIOS Guidelines for metadata and data sharing (cf. https://sios-svalbard.org/sites/sios-svalbard.org/files/common/sdms-guidelines4providers.pdf). If I understand their requirements and recommendations correctly (it's a while ago I read them), NetCDF plays an important role in these approaches. Not sure if use cases like these were in your mind when you asked about NetCDF ;-)"

We already have anOPeNDAP issue but it's probably out of scope for the NetCDF project:

@siacus @mreekie @atrisovic and I had a meeting today: https://docs.google.com/document/d/1azN_uIc3f5MNP7eHm_gvvSfJ-4wrGjLknm5kOZ2fffE/edit?usp=sharing

Lots to dig into! I also updated the description of this issue to explain where this project is coming from in terms of funding from two different grants.

Here's some metadata we were able to pull out of a NetCDF file Stefano found at https://github.com/alisonboyer/daac_tutorial_work-1/blob/master/gimms3g_ndvi_1982-2012.nc4 :

$ java -jar build/libs/netcdfAll-5.5.4-SNAPSHOT.jar gimms3g_ndvi_1982-2012.nc4 
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
netcdf gimms3g_ndvi_1982-2012.nc4 {
  dimensions:
    nv = 2;
    time = UNLIMITED;   // (31 currently)
    lat = 840;
    lon = 4320;
  variables:
    double time(time=31);
      :units = "months since 1982-01-01 00:00:00";
      :bounds = "time_bnds";
      :calendar = "standard";
      :standard_name = "time";
      :_ChunkSizes = 1U; // uint

    double time_bnds(time=31, nv=2);
      :calendar = "standard";
      :units = "months since 1982-01-01 00:00:00";
      :_ChunkSizes = 1U, 2U; // uint

    double lat(lat=840);
      :standard_name = "latitude";
      :long_name = "latitude";
      :units = "degrees_north";
      :_ChunkSizes = 840U; // uint

    double lon(lon=4320);
      :standard_name = "longitude";
      :long_name = "longitude";
      :units = "degrees_east";
      :_ChunkSizes = 4320U; // uint

    double NDVI(time=31, lat=840, lon=4320);
      :grid_mapping = "crs";
      :standard_name = "normalized_difference_vegetation_index";
      :long_name = "Mean Normalized Difference Vegetation Index in growing season (June, July, and August)";
      :cell_methods = "area: mean time: mean";
      :_FillValue = -9999.0; // double
      :missing_value = -9999.0; // double
      :_ChunkSizes = 1U, 120U, 720U; // uint

  // global attributes:
  :GDAL_AREA_OR_POINT = "Area";
  :GDAL = "GDAL 1.10.0, released 2013/04/24";
  :Conventions = "CF-1.6";
  :title = "Mean Normalized Difference Vegetation Index in growing season (June, July, and August)";
  :source = "GIMMGS3g";
  :contact = "Kevin Guay";
  :institution = "Woods Hole Research Center";
  :email = "kguay@whrc.org";
  :references = "Guay, K.C., P.S.A. Beck, L.T. Berner, S.J. Goetz, A. Baccini, and W. Buermann. 2014. Vegetation productivity patterns at high northern latitudes: a multi-sensor satellite data assessment. Global Change Biology 20(10):3147�3158. doi:10.1111/gcb.12647";
  :history = "Converted to CF-netCDF v4 at Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) on Feb. 10th, 2015";
}

Here's how it looks in the Grid Viewer:

Screen Shot 2022-10-26 at 2 15 07 PM

Note that not all NetCDF files have so much metadata. https://dataverse.harvard.edu/file.xhtml?fileId=6550315&version=1.0 for example doesn't have any global attributes. No title, no author, no email.

Next steps for us:

mreekie commented 1 year ago

@pdurbin @atrisovic @siacus This is the first problem statement that I'm looking over wrt the meeting nots.

The only thing I did here is move things a bit to match the pattern of how these issues are laid out.

The pattern for these issues is:

The Description:

The 1st Comment


@pdurbin I added to: IQSS/dataverse#9110

Next Steps:

mreekie commented 1 year ago
mreekie commented 1 year ago

28 days ago

@pdurbin I created a draft issue with all the right labels for the first of the two design papers you talked about today:

mreekie commented 1 year ago

The google drive for this is here

mreekie commented 1 year ago

November: Aim 1. We identified three possible options that would aid in automatic metadata extraction and generation. The next step would be to decide if / what tool we want to use and build on for the metadata extraction. Aim 2. We had meetings with the experts from the Center for Geographic Analysis. We have also interviewed a few researchers (and plan to interview more) with the same goal. The next step is to make a decision whether to support solely geospatial files or also high-dimensional files that are used in other fields (like genetics).

mreekie commented 1 year ago

@pdurbin @atrisovic I'm prepping for the year 2 GREI deliverables reporting. I want to drop you a note. I've learned a lot in the past year.

One of the things I've learned is that I'm not actually involved in tracking this for reporting purposes back to the NIH. The reason is that the grant went to you, Ana rather than Stefano or someone else in my reporting chain, and so it is outside the things that I need to show progress for money for.

I'm pretty sure the two of you tried to explain that to me last year but I didn't follow. Sorry.

Going forward, I'll leave this project issue

It's helpful to have it here, but I won't be updating it.


For what it's worth the labeling works like this.

This may be too much information, but Phil, you may find it helpful.

pm.f02 - this identifies this as being the item that Ana got funded for. pm.f01 is the GREI 4 year work.

You'll see other labels like pm.f01-d-y01-a03-t01. These will be associated both with the deliverable issue and with the individual issues that apply towards the paid deliverables. NetCDF won't have these as I'm not tracking the individual issues.

As far as labels beyond that another thing I learned was that the team is has their own strong feelings about how they want things to be labelled for development work and sometimes my "pretty" labels caused confusion. This is why this label is so obviously awkward looking - no dev in their right mind would make such ugly labels. :)

I did not interfere with the "NIH: NetCDF" label. The two of you may be using that label to group the dev work and so I don't want replace it. If it's helpful to you then use it. If it's not then feel free to delete it.

atrisovic commented 1 year ago

Hi @mreekie ok looks good! Let me know if you need any input or info from me :)

mreekie commented 1 year ago

@pdurbin @atrisovic Another Housekeeping update:

mreekie commented 1 year ago

Grooming:

Todo:

cmbz commented 8 months ago

@pdurbin is there anything else required for this issue to be completed, or can it be closed?

pdurbin commented 8 months ago

It can be closed. I'll do it. The main thing that remains is that @atrisovic is writing a paper. @kmika11 and @JR-1991 are helping as well. I'm planning on helping too.