ESIPFed / gsoc

Project ideas and mentor guidance for ESIP members to participate in Google Summer of Code.
Apache License 2.0
34 stars 16 forks source link

Service for verifying data authenticity of cloud-distributed data stores #8

Closed edkearns closed 6 years ago

edkearns commented 6 years ago

Idea

NOAA has been publishing its open data on commercial cloud partners’ platforms as part of its Big Data Project in order to enable easier access and use of those data. However, when users consume NOAA data from a non-NOAA partner platform there needs to be a way for users to verify the authenticity of those data. Ideas are solicited for techniques to verify that those cloud-based data are indeed the same as the original copy of NOAA data. These techniques should allow dynamic, on-demand verification of data at the file/object level, and/or at the tool level (e.g. database or visualization).

Techniques could include, but are not limited to, open distributed ledgers (blockchain), api-accessible catalogs and similar. Open source tools such as hyperledger are encouraged but not required.

NOAABigData

Skills Needed (No prescribed technologies or languages.)

Mentors Ed Kearns, NOAA Chief Data Officer

vyomshm commented 6 years ago

Hello,

I'm interested in working on this. I have some questions and doubts.

  1. Ideally, what would be the expected outcome for the project? In my mind, I'm envisioning an open api backed by immutable records/indexes of all files(in a particular dataset for instance) stored on an open distributed ledger. Am i close?

  2. If the proposal includes using a blockchain, is there any preference for using private/permissioned ledger systems(e.g. - hyperledger) as opposed to open ledger systems (e.g. - ethereum) ?

  3. A verification system using immutable distributed ledger catalogs might require some computational (e.g. - hosting an IPFS node) and/or financial resources(gas costs in case of ethereum/ Filecoin) on the part of the org. What are the constraints, parameters and non-functional requirements around these for us to keep in mind before proposing a solution?

Lastly, I would appreciate any guidance on how to proceed with this project.

Thanks!

skybristol commented 6 years ago

Isn't this problem complicated in that at least some of the data housed on different cloud providers may be transformed into completely different representations? How Amazon decides to house and make the data accessible on their platform is different than how Google makes the data available to Google Earth Engine. Could an approach include some form of model or workflow that generates a calculated result such as a space/time metric at some scale that appropriately exercises enough of the data to statistically verify adequate data integrity?

edkearns commented 6 years ago

Thanks for your interest.

  1. Yes, I think you are on the right track. An open catalog or ledger that would allow users to understand where the data came from and that they are authentic would be one kind of solution.
  2. No preference at this point.
  3. The organization may be willing to underwrite the costs of the computation/resources required for the solution, but an economical solution (in both funds and energy) would be preferable, of course.

Cheers, Ed

On Fri, Mar 16, 2018 at 8:11 AM, Vyom Sharma notifications@github.com wrote:

Hello,

I'm interested in working on this. I have some questions and doubts.

1.

Ideally, what would be the expected outcome for the project? In my mind, I'm envisioning an open api backed by immutable records/indexes of all files(in a particular dataset for instance) stored on an open distributed ledger. Am i close? 2.

If the proposal includes using a blockchain, is there any preference for using private/permissioned ledger systems(e.g. - hyperledger) as opposed to open ledger systems (e.g. - ethereum) ? 3.

A verification system using immutable distributed ledger catalogs might require some computational (e.g. - hosting an IPFS node) and/or financial resources(gas costs in case of ethereum/ Filecoin) on the part of the org. What are the constraints, parameters and non-functional requirements around these for us to keep in mind before proposing a solution?

Lastly, I would appreciate any guidance on how to proceed with this project.

Thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/GSoC/issues/8#issuecomment-373694891, or mute the thread https://github.com/notifications/unsubscribe-auth/AfablfwtmQzUO5G_VgzLt1w6Z6oI_Gztks5te6vZgaJpZM4SjE2L .

--

Edward J. Kearns, Ph.D.

Chief Data Officer

Office of the Chief Information Officer / High Performance Computing and Communications

National Oceanic and Atmospheric Administration

151 Patton Ave, Asheville, NC 28801

Ed.Kearns@noaa.gov

Office: 828-350-2410 Cell: 828-273-1998 Fax: 828-271-4876

www.noaa.gov/big-data-project

"An expert is a person who has made all the mistakes that can be made in a very narrow field." - N. Bohr

edkearns commented 6 years ago

Right, and the original data that are the feedstock for the new presentation could/should be verified, and perhaps the process that creates the new dataforms should also be verified in order for the new dataforms to be considered authenticated. And yes, a statistical subsampling may be the most economical approach given the large number of data files and data points inherent in the problem.

Ed

On Tue, Mar 27, 2018 at 7:51 PM, skybristol notifications@github.com wrote:

Isn't this problem complicated in that at least some of the data housed on different cloud providers may be transformed into completely different representations? How Amazon decides to house and make the data accessible on their platform is different than how Google makes the data available to Google Earth Engine. Could an approach include some form of model or workflow that generates a calculated result such as a space/time metric at some scale that appropriately exercises enough of the data to statistically verify adequate data integrity?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/GSoC/issues/8#issuecomment-376712522, or mute the thread https://github.com/notifications/unsubscribe-auth/AfablcPfRsWkt5PxkmTcAJAnVbdwwDy8ks5titCFgaJpZM4SjE2L .

--

Edward J. Kearns, Ph.D.

Chief Data Officer

Office of the Chief Information Officer / High Performance Computing and Communications

National Oceanic and Atmospheric Administration

151 Patton Ave, Asheville, NC 28801

Ed.Kearns@noaa.gov

Office: 828-350-2410 Cell: 828-273-1998 Fax: 828-271-4876

www.noaa.gov/big-data-project

"An expert is a person who has made all the mistakes that can be made in a very narrow field." - N. Bohr

esip-lab commented 5 years ago

@edkearns any chance you or another NOAA mentor would like to update and open this issue up again for 2019 GSOC?