Closed xinaxu closed 2 years ago
@dkkapur
@xinaxu this is a lot of datasets! wow!
some of these look to be duplicates (both against what we have in Slingshot as well as in some instances across the proposed table). can i propose that you pick the top 10-15 that you'd like to see onboarded or would like to work on yourself?
@orvn @timelytree had some additional thoughts on adding more datasets as well. tagging them to share!
Still, there are just over 200 left, which is double the size of Slingshot's 82 current datasets. I think it's valuable to have more datasets, but we also have to consider that adding a large quantity will require modifications to Slingshot's UI, especially:
@dkkapur I want to have most of them eligible to Slingshot at once. It's 40PiB of data total, assume 10x replication, that's 400PiB or 0.4 EiB of useful data over 15 EiB of current network capacity. It will be a good story to show and tell. Also,there is not much dataset left for Slingshot. Bringing this list will encourage more people to join slingshot. @orvn Thanks for spending time to sort and dedup. Those dataset are also duplicates - KITTI, MMID. Hope it's not a great effort to add the filtering on the UI.
@xinaxu - I agree that it would be good to scope this in. Proposing that we pull these in (maybe in subsets) for 3.1 with an impending design change in the program (June-ish onwards).
@dkkapur Sounds good. Will wait for that and revisit this.
Closing since those dataset are being used for V3
Please note that the data in this bucket are the CFSv2 Operational Forecasts. To obtain other CFSv2 products such as the Operational Analysis, please visit our website.
ubuntu@ip-172-31-80-59:~/open-data-registry/datasets$ screen -r -d | NOAA Severe Weather Data Inventory (SWDI) | The Storm Events Database is an integrated database of severe weather events across the United States from 1950 to this year, with information about a storm event's location, azimuth, distance, impact, and severity, including the cost of damages to property and crops. It contains data documenting: The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce. Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area. Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event. Data about a specific event is added to the dataset within 120 days to allow time for damage assessments and other analysis. | 71.29 GiB | Various | https://registry.opendata.aws/noaa-swdi/ | | Community Earth System Model v2 Large Ensemble (CESM2 LENS) | The US National Center for Atmospheric Research partnered with the IBS Center for Climate Physics in South Korea to generate the CESM2 Large Ensemble which consists of 100 ensemble members at 1 degree spatial resolution covering the period 1850-2100 under CMIP6 historical and SSP370 future radiative forcing scenarios. Data sets from this ensemble were made downloadable via the Climate Data Gateway on June 14th, 2021. | 309.28 TiB | Various | https://registry.opendata.aws/ncar-cesm2-lens/ | | ZINC Database | 3D models for molecular docking screens. | 658.32 TiB | Various | https://registry.opendata.aws/zinc15/ | | NOAA Global Historical Climatology Network Daily (GHCN-D) | Global Historical Climatology Network - Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurement only. Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. It is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews. Some data are more than 175 years old. The data is in CSV format. Each file corresponds to a year from 1763 to present and is named as such. | 109.33 GiB | Various | https://registry.opendata.aws/noaa-ghcn/ | | High Resolution Population Density Maps + Demographic Estimates by CIESIN and Meta | Population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSV | 95.62 GiB | Various | https://registry.opendata.aws/dataforgood-fb-hrsl/ | | Image localization - fast.ai datasets | Some of the most important datasets for image localization research, including | 15.46 GiB | Various | https://registry.opendata.aws/fast-ai-imagelocal/ | | Rapid7 FDNS ANY Dataset | Subset of FDNS ANY queries against domain names produced by Rapid7 Project Sonar, made available in s3. | 151.26 GiB | Various | https://registry.opendata.aws/rapid7-fdns-any/ | | Analysis Ready Sentinel-1 Backscatter Imagery | The Sentinel-1 mission is a constellation of | 49.7 TiB | Various | https://registry.opendata.aws/sentinel-1-rtc-indigo/ | | NOAA Geostationary Operational Environmental Satellites (GOES) 16 & 17 | NOAA GOES-T will launch in March 2022!! For more information check out the GOES-T Webpage. | 1.36 PiB | Various | https://registry.opendata.aws/noaa-goes/ | | Genome Ark | The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life. | 429.62 TiB | Various | https://registry.opendata.aws/genomeark/ | | The Massively Multilingual Image Dataset (MMID) | MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. | 2.37 TiB | Various | https://registry.opendata.aws/mmid/ | | Allen Cell Imaging Collections | This bucket contains multiple datasets (as Quilt packages) created by the | 54.41 TiB | Various | https://registry.opendata.aws/allen-cell-imaging-collections/ | | Aristo Mini Corpus | 1,197,377 science-relevant sentences | 397.26 GiB | Various | https://registry.opendata.aws/allenai-aristo-mini/ | | DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue | This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted | 1.23 GiB | Various | https://registry.opendata.aws/dialoglue/ | | Digital Earth Africa Sentinel-1 Radiometrically Terrain Corrected | DE Africa’s Sentinel-1 backscatter product is developed to be compliant with the CEOS Analysis Ready Data for Land (CARD4L) specifications. | 206.7 TiB | Various | https://registry.opendata.aws/deafrica-sentinel-1/ | | NOAA Unified Forecast System (UFS) Marine Reanalysis: 1979-2019 | The NOAA UFS Marine Reanalysis is a global sea ice ocean coupled reanalysis product produced by the marine data assimilation team of the UFS Research-to-Operation (R2O) project. Underlying forecast and data assimilation systems are based on the UFS model prototype version-6 and the Next Generation Global Ocean Data Assimilation System (NG-GODAS) release of the Joint Effort for Data assimilation Integration (JEDI) Sea Ice Ocean Coupled Assimilation (SOCA). Covering the 40 year reanalysis time period from 1979 to 2019, the data atmosphere option of the UFS coupled global atmosphere ocean sea ice (DATM-MOM6-CICE6) model was applied with two atmospheric forcing data sets: CFSR from 1979 to 1999 and GEFS from 2000 to 2019. Assimilated observation data sets include extensive space-based marine observations and conventional direct measurements of in situ profile data sets. | 6.97 TiB | Various | https://registry.opendata.aws/noaa-ufs-marinereanalysis/ | | COVID-19 Harmonized Data | A harmonized collection of the core data pertaining to COVID-19 reported cases by geography, in a format prepared for analysis | 76.99 GiB | Various | https://registry.opendata.aws/talend-covid19/ | | International Neuroimaging Data-Sharing Initiative (INDI) | This bucket contains multiple neuroimaging datasets that are part of the International Neuroimaging Data-Sharing Initiative. Raw human and non-human primate neuroimaging data include 1) Structural MRI; 2) Functional MRI; 3) Diffusion Tensor Imaging; 4) Electroencephalogram (EEG) | 268.82 TiB | Various | https://registry.opendata.aws/fcp-indi/ | | NOAA Oceanic Climate Data Records | NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).
| 228.21 GiB | Various | https://registry.opendata.aws/noaa-cdr-oceanic/ | | NOAA Climate Forecast System (CFS) | The Climate Forecast System (CFS) is a model representing the global interaction between Earth's oceans, land, and atmosphere. Produced by several dozen scientists under guidance from the National Centers for Environmental Prediction (NCEP), this model offers hourly data with a horizontal resolution down to one-half of a degree (approximately 56 km) around Earth for many variables. CFS uses the latest scientific approaches for taking in, or assimilating, observations from data sources including surface observations, upper air balloon observations, aircraft observations, and satellite observations.
Please note that the data in this bucket are the CFSv2 Operational Forecasts. To obtain other CFSv2 products such as the Operational Analysis, please visit our website. | 357.76 TiB | Various | https://registry.opendata.aws/noaa-cfs/ | | A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018) | This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure includes 50 machines and the victim organization has 5 departments includes 420 PCs and 30 servers. This dataset includes the network traffic and log files of each machine from the victim side, along with 80 network traffic features extracted from captured traffic using CICFlowMeter-V3. | 452.75 GiB | Various | https://registry.opendata.aws/cse-cic-ids2018/ | | Cell Painting Image Collection | The Cell Painting Image Collection is a collection of freely | 1.94 TiB | Various | https://registry.opendata.aws/cell-painting-image-collection/ | | YouTube 8 Million - Data Lakehouse Ready | This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. | 3.17 TiB | Various | https://registry.opendata.aws/yt8m/ | | NOAA National Water Model Short-Range Forecast | The National Water Model (NWM) is a water resources model that simulates and forecasts water | 27.73 TiB | Various | https://registry.opendata.aws/noaa-nwm-pds/ | | SondeHub Radiosonde Telemetry | SondeHub Radiosonde telemetry contains global radiosonde (weather balloon) data captured by SondeHub from our participating radiosonde_auto_rx receiving stations. radiosonde_auto_rx is a open source project aimed at receiving and decoding telemetry from airborne radiosondes using software-defined-radio techniques, enabling study of the telemetry and sometimes recovery of the radiosonde itself. | 59.05 GiB | Various | https://registry.opendata.aws/sondehub-telemetry/ | | NOAA Global Forecast System (GFS) | The Global Forecast System (GFS) is a weather forecast model produced | 936.41 TiB | Various | https://registry.opendata.aws/noaa-gfs-bdp-pds/ | | UCSC Genome Browser Sequence and Annotations | The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome annotation track has been created by an academic research group, or, in a few cases, by commercial companies. Please acknowledge them by citing them. The information can be found by going to https://genome.ucsc.edu, selecting the respective genome assembly and clicking on the data track. At the end of the documentation, we provide a list of references and acknowledgements. | 73.11 TiB | Various | https://registry.opendata.aws/ucsc-genome-browser/ | | Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1 | Global daily-mean sea surface temperatures, presented on a 0.05° latitude-longitude grid, with gaps between available daily observations filled by statistical means, spanning late 1981 to recent time. Suitable for large-scale oceanographic meteorological and climatological applications, such as evaluating or constraining environmental models or case-studies of marine heat wave events. Includes temperature uncertainty information and auxiliary information about land-sea fraction and sea-ice coverage. For reference and citation see: www.nature.com/articles/s41597-019-0236-x. | 273.6 GiB | Various | https://registry.opendata.aws/surftemp-sst/ | | NOAA Global Ensemble Forecast System (GEFS) Re-forecast | NOAA has generated a multi-decadal reanalysis and reforecast data set to accompany the next-generation version of its ensemble prediction system, the Global Ensemble Forecast System, version 12 (GEFSv12). Accompanying the real-time forecasts are “reforecasts” of the weather, that is, retrospective forecasts spanning the period 2000-2019. These reforecasts are not as numerous as the real-time data; they were generated only once per day, from 00 UTC initial conditions, and only 5 members were provided, with the following exception. Once weekly, an 11-member reforecast was generated, and these extend in lead time to +35 days. | 388.8 TiB | Various | https://registry.opendata.aws/noaa-gefs-reforecast/ | | Deutsche Börse Public Dataset | The Deutsche Börse Public Data Set consists of trade data aggregated to one minute intervals from the Eurex and Xetra trading systems. It provides the initial price, lowest price, highest price, final price and volume for every minute of the trading day, and for every tradeable security. If you need higher resolution data, including untraded price movements, please refer to our historical market data product here. Also, be sure to check out our developer's portal. | 16.05 GiB | Various | https://registry.opendata.aws/deutsche-boerse-pds/ | | Digital Earth Africa ALOS PALSAR, ALOS-2 PALSAR-2 and JERS-1 | The ALOS/PALSAR annual mosaic is a global 25 m resolution dataset that combines data from many images captured by JAXA’s PALSAR and PALSAR-2 sensors on ALOS-1 and ALOS-2 satellites respectively. This product contains radar measurement in L-band and in HH and HV polarizations. It has a spatial resolution of 25 m and is available annually for 2007 to 2010 (ALOS/PALSAR) and 2015 to 2020 (ALOS-2/PALSAR-2). | 3.1 TiB | Various | https://registry.opendata.aws/deafrica-alos-jers/ | | GeoNet Aotearoa New Zealand Data | GeoNet provides geological hazard information for Aotearoa New Zealand. This dataset contains data and products recorded by the GeoNet sensor network. The dataset currently include GNSS data and additional datasets will be added in the near future. GNSS (Global Navigation Satellite System) data include raw data in proprietary and Receiver Independent Exchange Format (RINEX) and local tie-in survey conducted during equipment changes, more details can be found on 'the GeoNet geodetic page' website. Coastal gauge data include relative measurement of sea level measured by tsunami monitoring gauges. Raw and quality control data are provided in CREX format (Character Form for the Representtion and eXchange of metereological data), more details can be found on 'the GeoNet coastal tsunami monitoring gauges page'. | 7.73 TiB | Various | https://registry.opendata.aws/geonet/ |