filecoin-project / devgrants

👟 Apply for a Filecoin devgrant. Help build the Filecoin ecosystem!
Other
377 stars 308 forks source link

Tamarin's De-identification of unstructured data: Filecoin Open Grant Application #1745

Closed Airwhale closed 5 months ago

Airwhale commented 6 months ago

Open Grant Proposal: Tamarin's De-identification of unstructured data: Filecoin Open Grant Application

Project Name: Open Source De-Identification tool for unstructured data.

Proposal Category: Choose one of Developer and data tooling, Applications, Integrations, Research & protocols, or Other. Learn what these categories are here: https://github.com/filecoin-project/devgrants/tree/master/open-grants#readme. Data tooling

Individual or Entity Name: Tamarin Health.

Proposer: https://www.linkedin.com/in/katherinekuzmeskas/

Project Repo(s) Please list the Github repos used for this project's work. https://github.com/NectarProtocol/mpc-crypto-lib is our open-source MPC engine: this proposal will likely be an alternative way of handling data entering our ecosystem.

(Optional) Technical Sponsor: If you have previously discussed this project with a member of the IPFS or Filecoin project teams and they have agreed to be a technical sponsor, include their name and/or github handle here

Porter Stowell

Do you agree to open source all work you do on behalf of this RFP under the MIT/Apache-2 dual-license?: Please respond with "Yes" or "No". Yes

YES

Project Summary

Our deliverable will be a safe, secure, and effective LLM-powered de-identification module for anyone’s use. This protocol will have four main modules, which we will supplement with a HIPAA compliance network:

Identification Module: Design and engineer an LLM to detect personal identifying information (PII) within datasets. Anonymization Module: Employs a second LLM to replace detected PII with anonymized, randomized data. Included in this is the capability to bridge data sets with data from Synthetic data providers, as creating synthetic data that is statistically similar to the original data is a multi-million dollar problem to solve, and thus beyond the financial possibility of this grant. Pruning Module: This module applies a deterministic finite state machine to eliminate any remaining indirect identifiers and ensures compliance with the user's k-anonymity requirements. Encryption Module: This module optionally encrypts the users' data and writes it into Filecoin. HIPAA Compliance Support Network and Documentation: Full HIPAA compliance, per Federal regulation, is not possible (on the Filecoin network or elsewhere) without an expert, human (not machine/software) determination that the data is sufficiently de-identified. As such, even though it is not a software module, we believe it is crucial to develop a network of vetted providers that can facilitate full HIPAA compliance. Without this, HIPAA compliance cannot be achieved. We will secure at least 5 HIPAA compliance auditors and at least five synthetic data production vendors through a robust assessment and verification process of the hundreds of potential companies. This will give Filecoin network users an immediate, validated, and necessary resource to fully utilize their data in a HIPAA-compliant manner. Additionally, we will create HIPAA compliance guides that Filecoin may use in your developer documentation to help guide Filecoin users toward HIPAA compliance.

This de-identification system will handle various data types, including text and structured data, effectively removing both direct and indirect personal identifying information (PII). It offers flexibility in data handling by supporting both encrypted and unencrypted data storage. Success for this project is the reliable, efficient creation of de-identified, HIPAA-compliant data that is legal for use by entities that would typically not be able to access it under current regulations. We aim to have our system achieve a 100% success rate as determined by the HIPAA Compliance Support Network we create.

Adoption, Reach, and Growth Strategies

The market opportunity for Tamarin, particularly in healthcare data security and privacy, is rapidly expanding. Fortune predicts a 41% compound annual growth rate for the data privacy market alone, reaching well over $30 billion in 2030 (“Data Privacy Software Market Size, Share & Growth 2023-2030” 2023). The global healthcare data market was worth $26.7 billion in 2022 and is projected to be $122.2 billion in 2030 (Patel, 2023). The projected surge in the Healthcare Data Analytics market is driven by the vast and growing volume of healthcare data, a shift towards value-based care, advancements in AI and big data technologies, and stringent privacy regulations such as GDPR, HIPAA, and new State-based regulations such as Washington State’s My Health, My Data which went into effect just a few months prior, in October 2023.

We expect our system to be especially useful for governmental institutions, which are often tasked with the distribution of data that needs to be de-identified, and research institutions, which can use our system to De-Identify data for supplemental information to published papers.

As we market and acquire customers for our decentralized, privacy-preserving identity-linked HIPAA-compliant database, Nectar, we intend to inform users of this Filecoin De-ID tool. This tool can and should be used when maintaining identity linkage in a dataset is not important or after databases have been joined through Nectar. Excitingly, we have recently changed the architecture of Nectar so that it can store, manipulate, and retrieve data on the Filecoin network, allowing tight integration between these two tools.

We also intend to offer this system to our clients as part of our larger confidential computing software stack. Our flagship product, Nectar, will have a large amount of data from our industrial clients that needs to be de-identified to be used internally in different departments of their companies and outside entities they are collaborating with.

Our interactions with potential clients are ongoing and promising. With our current commercialization partner, AllSpark Health, we have a targeted goal of 8 integration and enterprise customers by Quarter 1 of 2025 and projected revenue at $400,000; we will then triple that in the remainder of 2025. While ambitious, our customer and revenue targets are grounded in direct experience from our team and our commercialization partner, AllSpark. AllSpark is a computational insights and business strategy company dedicated to innovative data partnerships and redefining the boundaries of what is possible in healthcare. AllSpark’s founder left big pharma to pioneer sustainable, patient-centric collaborations for the mutual benefit of all parties in the healthcare ecosystem. Together, we are working on a commercialization plan that includes finding and securing data provider pilots in the small to medium-sized pharma space.

We are already in advanced discussion with five companies who are interested in using our confidential computing toolset: one of which is a multi-billion dollar, global pharmaceutical company, one is in the top five health systems in the nation, one is a global rare disease organization with a large multi-national patient pool, another is a biotechnology company valued at north of $1 billion that is pioneering new technologies to advance early cancer detection, and another is a blockchain-enabled drug development startup. We are in discussions to act as in-platform private computation and data-ecosystem network connections for these companies and offer to host their data for public computation.

Early conversations with potential customers have revealed that leveraging improved privacy-preserving tools such as a De-ID system is a top priority. A De-ID system is often the simplest way for our clients to prepare data for analysis and sharing with other entities, and we believe the De-ID system we build under this grant and in partnership with Filecoin will meet the demand.

Development Roadmap

Milestones 1-4 will be led by Phil Chevalier while being assisted by the rest of the team, with Shaun Geer and Katherine Kuzmeskas providing contextual information and the rest of the team providing technical and coding support.

Total Budget Requested

| Milestone # | Description | Deliverables | Completion Date | Funding | |===|===|===|===|===| | 1 | Identification Module | LLM to remove directly and indirectly personal identifying information | (3 months from submission) | $17,000 | | 2 | Anonymization Module | LLM integrated to provide synthetic data | (4 month from submission) | $7,500 | | 3 | Pruning Module | Deterministic module added to remove direct identifying data | (6 months from submission) | $12,500 | | 4 | Encryption Module | Encrypted and encrypted data able to be placed on Filecoin | (7 months from submission) | $7,500 |

Maintenance and Upgrade Plans

| 5 | HIPAA Compliance Network and Documentation | Network partners and documentation needed in order for Filecoin users to meet Federal HIPAA requirements for data use | 7 months from submission | $5,500

This project will become part of Nectar’s confidential computing stack, and we will increase its capacity as the company's tech stack grows. In general, though, we are designing and building it so that we expect it not to need improvements once released.

https://www.tamarin.health/

Relevant Experience

CEO and founder Katherine Kuzmeskas is a 2x founder with nearly 15 years of experience in health data and infrastructure, including at Yale New Haven Health System, one of the nation’s largest hospital networks. She has overseen the design, development, and implementation of five different blockchain projects. She is a sought-after speaker on the intersection of blockchain, privacy-preserving technologies, and healthcare, having been invited to speak at conferences across the globe, including Stanford Medicine, BIO International Conference (14,000 biotech and pharma leaders), and Future Health Basel. In Fall 2023, she was selected as a Yale Ventures Entrepreneur In Residence, which connected her to the global Yale network and the biopharmaceutical industry. In January, Katherine was tapped by Coinbase to speak with Members of Congress in DC as a subject matter expert on real-world, impactful use cases of blockchain and privacy-preserving technologies. Phil Chevalier, Tamarin’s Distributed Database Architect and lead engineer, is a distinguished software engineer with specialized expertise in distributed database architecture and blockchain technology. Holding a Master's degree in Computer Science from the University of Texas at Austin, Phil has demonstrated exceptional skill in developing secure, innovative solutions across various technology sectors. His work at Tamarin led to the creation of the first Node Package Manager designed for encrypting data across multi-party computation nodes, showcasing his proficiency in full-stack development and his commitment to advancing medical science through privacy-preserving data sharing. Previously, as a Blockchain Engineer at Webisoft, Phil contributed to high-impact projects, including smart-contract development for the Solana and Ethereum platforms, raising significant funding and enhancing the metaverse and NFT ecosystems. In addition to the two core members of Tamarin, we have developed a strong support network for the development and deployment of our technologies. Included in this network is: CryptoOracle Collective: A network of web3 professionals that is working with Tamarin to build and position Nectar within various web3 ecosystems. CryptoOracle Collective provides services in technical writing, Business development, Fundraising, tokenomics, design ethnography, other types of qualitative and quantitative analysis, and other services. Operational Support, Advisor: Shaun Geer, M.S., M.A., is from the CryptoOracle Collective and provides operational support to Tamarin and coordinates support from CryptoOracle Collective. He has worked in immunology and pathology research for over five years, having one patent and co-authored multiple papers in that field. He has been in the blockchain space for the past two years, having worked with BanklessDAO, Saddle Finance, Threshold, Crypto Mondays, the Lifted Initiative, and many others. He provides operational support, managing labor from CryptoOracle, product work, technical writing, and business development. Technical contractor: Brett Hemenway Falk, Ph.D., is head of the “Crypto and Society” Lab at the University of Pennsylvania with over 10+ years of experience in cryptography. He has co-authored over 70 academic papers in cryptography, many dealing with real-world applications such as that described in this document. He will be a consultant on this project, and he and his team at the University of Pennsylvania will be designing solutions and assisting Phil in implementing them for this project. Technical advisor: Riad Wahby, PhD, Assistant Professor in ECE at CMU, focusing on systems, security, and applied cryptography. Interested in questions like “How can we build trustworthy chips at untrusted chip fabricators?” and “How do we secure operating systems against malicious peripherals?” Recent focus has been on probabilistic proof systems and cryptography.

This team is working towards the ultimate goal of being the infrastructure of modern health data, a DataOS for medicine, bioscience, and health apps. We intend for the health industry to use our system for all medical, scientific, and personal uses of health data. The de-identification of datasets is a part of this goal. Through our platform, patients or patient-approved medical systems will be able to access, add, modify, or delete data safely, following local and national laws and regulations. We intend our system to be a universal data infrastructure system that allows for the sharing and controlling of all data.

Team code repositories

https://github.com/NectarProtocol/mpc-crypto-lib

Additional Information