filecoin-project / devgrants

👟 Apply for a Filecoin devgrant. Help build the Filecoin ecosystem!
Other
371 stars 308 forks source link

Decentralized Federated Learning on a Data Mesh of Content-Addressable Transformers (CATs) #1554

Closed JEJodesty closed 11 months ago

JEJodesty commented 1 year ago

Open Grant Proposal: Decentralized Federated Learning on a Data Mesh of Content-Addressable Transformers

Project Name: DFL on CATs: Decentralized Federated Learning (DFL) on Content-Addressable Transformers (CATs)

Proposal Category: Integrations

Entity Name: This proposal is on behalf of BlockScience, Inc.

Proposer: @JEJodesty

Filecoin ecosystem affiliations: BlockScience has been providing CryptoEconomic system design for the Filecoin ecosystem at large, Protocol LabsCryptoEconLab, and the Filecoin Foundation since prior to the network’s launch. These system designs include Filecoin’s Batch Balancer and the optimization of informing parameter decisions through large-scale simulations. Some of our most recent work includes the Rapid Report on FIP-0056 and CDM, the Core Protocol Review series of Workshops & Articles, the Retrieval Pinning modeling work, the Consensus Pledge Educational Calculator, and scoping High Impact Research Topics on Economics. Going even further in the past, BlockScience has been involved in designing, testing and tuning several of Filecoin's economic primitives, and performing technical review on the implications of past FIPs. More can be found in our Collaboration Book. BlockScience has also designed, implemented, and deployed Anti-Sybil Operationalized Process (ASOP), a GitcoinDAO governance framework to increase the legitimacy of the Gitcoin funding process by removing and banning automated funding bots. This framework applies grey box machine learning using Random Forest Classification for fraud detection and was Terraformed on Google Cloud Platform (GCP).

Technical Sponsor: Supported by David Aronchick (Protocol Labs) and Wes Floyd (Protocol Labs) @aronchick

Do you agree to open source all work you do on behalf of this RFP under the MIT/Apache-2 dual-license?: Yes

Project Summary

IPFS Big Data processing workloads utilizing Filecoin Storage is a critical need in the IPFS ecosystem. BlockScience is incentivizing the increase in value of Filecoin and data on IPFS by contributing to Filecoin’s ecosystem with the continued design and implementation of Content-Addressable Transformers (CATs) which will leverage Bacalhau’s Compute over Data (CoD) architecture to enable a Data Mesh of Data Products for Decentralized Federated Learning (DFL).

CATs is a unified Data Product collaboration framework written in Python that enables the establishment of a scalable and self-serviced decentralized Data Mesh network of verifiable and scalable distributed computing workloads with data provenance using interoperable distributed computing frameworks deployable on Kubernetes. CATs provides interfaces for the entire cloud service model (SaaS, PaaS, IaaS), can be deployed on managed Kubernetes service architecture offered by centralized Web2 cloud service providers, and uses IPFS as a decentralized Web3 network layer. CATs make the execution of scalable distributed computing workloads for Big Data processing with Data Provenance and Scientific Computing capabilities portable between Web2 & Web3 infrastructure with minimal rework or modification.

CATs empower effective cross-domain collaboration on products between cross-functional & multi-disciplinary teams and organizations by reducing the operational overhead of adding new data sources. They enable this by decentralizing and distributing responsibility to those within bounded domains to support continuous change and scalability.

A horizontal DFL technique will be deployed on a Data Mesh of CATs’ Data Product’s as a Federated Average (FedAvg) of a Linear Regression machine learning model applied to the MNIST dataset for classifying images of handwritten digits. This will address concerns and regulations about the privacy of user-generated text and image data containing sensitive and/or personal information. Due to CATs’ Data Mesh being a decentralized / self-orchestrated peer-to-peer network, a DFL approach is necessary because there is no need to share data with a central server unlike Centralized Federated Learning (CFL). DFL will increase security by preventing single point failures because model updates are exclusively exchanged between interconnected nodes without central server orchestration.

Impact

This is a proposal for a second iteration of CATs' research proof-of-concept (POC) as a reference implementation for the IPFS, Filecoin, and CATs communities. Funding this proposal will generate collaborative opportunities on CATs to increase demand for Storage Deals issued to IPFS Storage Providers that receive Filecoin for storing more data (Filecoin Storage) from Big Data processing requests served by CATs’ Data Products. These Data Products will incentivize the usage of Filecoin Storage due the the value of Scientific Computing and Big Data capabilities as well as Decentralized Federated Learning use-cases by increasing the amount of possible Filecoin needed for Filecoin Storage service requests. They will also increase the value of Filecoin and data on IPFS by establishing a data-driven economy on a Data (Service) Mesh of CATs that incentivize the conversion of organizations' operational (OpEx) budgets into Filecoin for the initialization and maintenance of CATs’ Data Products using FIlecoin Storage.

Project Description

Given BlockScience’s expertise in Systems Engineering and Scientific Computing within the context of data-driven economic systems and machine learning, we are currently designing and implementing Content-Addressable Transformers (CATs) to enable Data Product implementations on a Data Mesh that incentivize the storage and processing of more data on IPFS using Filecoin Storage. CATs’ design will be implemented to fulfill a Decentralized Federated Learning Data Product use-case.

Data Products are implemented as compute node peers on a Data Mesh network that encapsulate code, data, metadata, and infrastructure to function as a service providing access to a business domain's analytical data as a product. Data Products use the Architectural Quantum domain-driven design principle for peer nodes that represent the “smallest unit of architecture that can be independently deployed with high functional cohesion, and includes all the structural elements required for its function” (“Data Mesh Principles and Logical Architecture” - Zhamak Dehghani, et al.).

BlockScience wants CATs Data Products deployable on decentralized architecture provided by Bacalhau Node such that Data Mesh architecture can be implemented on incentivized Web3 networks. Bacalhau Node will act as CATs’ integration point between Web2 and Web3 workloads by providing a Big Data job submission interface as well as ingress, egress, and processing options for Big Data workloads on IPFS data. In this way, CATs will increase the value of data on IPFS and Filecoin by facilitating the migration of more data to Filecoin Storage with the introduction of distributed computing workloads for Big Data processing with data-locality and Scientific Computing capabilities to the IPFS network. This is achievable via the integration of CATs and Bacalhau ecosystems by deploying verifiable, scalable, and distributed CATs executors on and along-side Bacalhau Node such that Bacalhau and CATs share the same execution paradigm (Kubernetes).

CATs facilitate access to Web3 decentralized systems by Web2 federated commercial systems, the transfer of knowledge between these systems, and the maintenance of provenance of these transfers. Federated Learning (FL) architecture is an example of this knowledge transfer from centralized to decentralized for CATs’ Data Mesh. Decentralized Federated Learning (DFL) is a machine learning technique that enables a large number of clients (e.g., personal devices or organizations) to collaboratively train a distributed machine learning model for all clients by enabling them to decentrally coordinate the training of their local models with their local data such that data is completely private from other clients and model update aggregation occurs on the clients instead of a central server. Decentralized Federated Averaging is a DFL technique that will be deployed on CAT’s Data Mesh and will enable the exchange model parameter updates and perform the model aggregation given a specific algorithm for a certain network topology until the model reaches satisfactory performance.

The design and implementation of CATs will include the deployment of Ray, as CATs’ unified compute framework for distributed / parallelized data processing, with access to other Ray data platform ecosystem integrations such as Apache Spark & Dask. Ray will be deployed as a middleware layer on top of Bacalhau Node, which will be used as the peer-to-peer network layer of CATs. A DFL FedAvg of a Linear Regression machine learning model applied to classify images of handwritten digits in the MNIST dataset and will be deployed on a Data Mesh of CATs’ Data Products using PyTorch and Ray Train to accomplish this goal. The application of machine learning operations for this deployment will be implemented as CATs’ Data Products’ processes for each Data Mesh peer.

Vision

The vision of CATs’ Data Products participating on a Filecoin incentivized Data Mesh is to decentralize overall product ownership of business domain knowledge represented by a Data Model into individual Data Model entities served by Data Products with their own life-cycles. This is a customer-centric approach to overall project implementation life-cycles with nested Data Product life-cycles that have tighter feedback loops (a.k.a. Agility).

The operation and maintenance of Data Products on a Data Mesh will occur between independent multidisciplinary teams within multiple organizations across business domains using Filecoin for the operation and maintenance of CATs’ Data products. These teams will operate, contribute, and maintain different portions of the entire cloud service model (SaaS, PaaS, IaaS) in a way suitable for their roles using the CATs’ API to serve individual Data Model entities on a Data Mesh for a variety of use-cases. These use-cases involving clients and (I.O.T. / smartphones) devices with their own local datasets collaboratively training machine learning models has a wide range of applications such as fraud detection in finance, predicting disease diagnosis or treatment outcomes for hospitals, product recommendation in e-commerce, predicting road conditions or traffic patterns by autonomous vehicles, the prediction of crop yield or disease outbreaks in agriculture, etc.

Goals

Big Data processing of IPFS data with Scientific Computing capabilities will increase demand for Storage Deals issued to IPFS Storage Providers (Filecoin Storage) to store Big Data processed by CATS, and incentivize the allocation from of organizations' operational budgets into Filecoin for the operation and maintenance of CATs’ Data Products. The Bacalhau documentation and GitHub wiki promotes the vision of scalable distributed computing for Big Data processing workloads with data-locality and Scientific Computing capabilities on IPFS data via Bacalhau.

The primary goal of this proposed research is to fulfill a trilateral technical use-case benefiting Filecoin, Bacalhau, and CATs ecosystems by cross-integrating Bacalhau and CATs’ compute node peers. This cross-integration will enable the portability of Big Data processing with Scientific Computing capabilities between Web2 and Web3 infrastructure with minimal rework or modification, and incentivizes the use of Filecoin Storage via Bacalhau for the operation and maintenance of CATs’ Data Products. This goal is achievable by incorporating the Bacalhau ecosystem into CATs’ Architectural Quantum. The secondary goal of this proposed research is to fulfill a Scientific Computing use-case that incentivizes the increase in value of Filecoin and data on IPFS by classifying images of handwritten digits in the MNIST dataset. A DFL FedAvg of a Linear Regression machine learning model applied to the MNIST dataset will be deployed on a Data Mesh of CATs’ Data Products to accomplish this goal.

The primary and secondary goals will be fulfilled by CATs ability to process large structured and unstructured data on IPFS and provide CPU and GPU accelerated Scientific Computing capabilities to the Bacalhau ecosystem using Ray as middleware on top of Bacalhau Node as CATs’ execution paradigm. Ray also provides access to other distributed computing framework integrations such as Apache Spark and Dask. Python, Pandas DataFrame, and ANSI SQL will be a unified interface for tabular data processing via Modin. In this way, existing cloud-based Big Data processing implementations built on these technologies have a simple migration path to Bacalhau.

The CATs framework will have access to Web2 interoperable distributed computing frameworks via Ray deployed on incentivized Web3 infrastructure such as Bacalhau for Big Data processing using Filecoin Storage. This enables IPFS to serve as an incentivized peer-to-peer network layer for a decentralized Data Mesh of CATs Data Products that establish a data-driven economy of services using IPFS and Filecoin. This cross-integration will surface the following CATs features:

Outcomes

Milestone 0: This deliverable involves the design and development of CATs’ Structure software component of CATs’ input

  1. Publish the following to a GitHub repository:
    • Software design of CATs’ Architectural Quantum
    • CATs’ Structure software component implementation of Order input; The Order will contain IPFS CIDs of the following Structure components:
      • Structure (PaaS IaC API): Software component of the platform on which the Function executes (Kubernetes)
        • InfraStructure (IaaS): Software sub-component of IaaS implemented using Terraform Python CDK
        • Plant (SaaS): Software sub-component implemented for CAT deployment and configuration of interoperable distributed / parallelized computing frameworks on a computing cluster on InfraStructure (IaaS) using Ray with access to platform integrations such as Apache Spark, Dask, etc.
  2. Continuous Integration (CI) test(s) of CATs’ Structure software component with PyTest & GitHub Actions with a GitHub test coverage report
  3. Report progress via changelog and the completion of GitHub Project issues

Milestone 1: This deliverable involves the design and development of CATs’ input (Order). Also design and implement the Decentralized Federated Learning (DFL) solution employing a FedAvg of a Linear Regression machine learning model that classifies hand-written images of numbers in the MNIST dataset using PyTorch and Ray Train. This DFL solution will be deployed on a Data Mesh of CATs’ Data Products.

  1. Publish the following to a GitHub repository:
    • Software design of CATs’ Architectural Quantum
    • CATs’ Function software component implementation of Order input. The Order will contain IPFS CIDs of the following Structure components:
      • Function (FaaS API): Software component of the operational procedure executed on the Structural framework
      • InfraFunction (FaaS): Software sub-component of CATs’ interface for configuring and executing Processes (SaaS) on Structure (PaaS) via distributed computing frameworks [or Plant (SaaS)]
      • Process (FaaS): Software sub-component for data processing / computation performed by distributed computing frameworks deployed on platform infrastructure [or the Plant] using DataFrames, SQL, Python, etc.
  2. Continuous Integration (CI) test(s) of CATs’ Function software component with PyTest & GitHub Actions with a GitHub test coverage report
  3. Report progress via a changelog and the completion of GitHub Project issues
  4. Classification performance report of Federated Averaged Linear Regression machine learning model using Accuracy and Precision
  5. Documentation: Provide installation and execution examples with descriptions

Milestone 2: This deliverable involves the deployment of CATs’ scalable REST API (Service Node) using Ray Service with endpoints that process CATs’ input and output HTTP requests.

  1. Update the GitHub repository with the following:
    • CATs’ REST API (Service Node) implementation
  2. Continuous Integration (CI) test(s) of DFL on CATs’ Data Mesh component with PyTest & GitHub Actions with a GitHub test coverage report
  3. Documentation: Provide installation and execution examples with descriptions
  4. Presentation with demo

Adoption, Reach, and Growth Strategies

CATs’ Data Product teams will be Web2 data professionals who want to leverage Data Mesh with access to a Web3 network layer. The operation and maintenance of these Data Products on a Data Mesh will occur between independent multidisciplinary teams within multiple organizations across business domains using Filecoin Storage as CATs’ input and output. These teams will operate, contribute, and maintain different portions of the entire cloud service model (SaaS, PaaS, IaaS) in a way suitable for their roles using the CATs’ API to serve individual Data Model entities on a Data Mesh for a variety of use-cases. CAT’s Data Product teams can be multidisciplinary due to the fact they can operate and maintain the different portions of the entire Web2 cloud service model based on role. For example:

Decentralized Federated Learning on a Data Mesh of Data Products have the following risks for which BlockScience will leverage the open-source community and a CATs working group to address:

Development Roadmap

Milestone 0: [$42,000 | 1 Research Engineer | Work Period (Assuming the receipt of funding before this period): July 3rd - 31st]

Design and develop the Structure software component of CATs’ Order input and use CoD and Ray deployments on Kubernetes for integration tests that output CATs’ Invoice. This Milestone will provide the Structure software component of the Order to serve as the Kubernetes execution paradigm of the Function software component of Milestone 1:

Milestone 1: [$58,000 | 1 Research Engineer | Work Period (Assuming the receipt of funding before this period): Aug. 1st - Sep. 11th]

Design and develop the Function software component of CATs’ Order input using the integration test from Milestone 0 to execute on the Structure component completed in Milestone 0. Also design and implement the Decentralized Federated Learning solution employing a FedAvg of a Linear Regression machine learning model that classifies hand-written images of numbers in the MNIST dataset using PyTorch and Ray Train. This DFL solution will be deployed on a Data Mesh of CATs’ Data Products:

Milestone 2: [$21,000 | 1 Research Engineer | Work Period (Assuming the receipt of funding before this period): Sep. 12th - Sep 25th]

Deploy CATs’ Service Node as a scalable REST API using Ray Service with endpoints that process CATs’ input and output HTTP requests:

Total Budget Requested

$121,000

Maintenance and Upgrade Plans

CATs is a research proof-of-concept (POC) that will be open-sourced on GitHub as a reference implementation of a Data Mesh framework. BlockScience will guide collaborators interested in contributing to CATs by establishing a community for CATs open-source implementations based on the reference implementation by establishing open-source sprints and a CATs working group.

Team

Team Members

Team Member LinkedIn Profiles

https://www.linkedin.com/in/jejodesty/ https://www.linkedin.com/in/david-f-sisson/ https://www.linkedin.com/in/mczargham/

Team Website

https://block.science/team

Relevant Experience

Joshua Jodesty: Joshua Jodesty has been a Data & Software Engineer conducting Machine Learning research in his professional and academic career and enjoys applying research using emerging technologies to design and implement scalable distributed systems on cloud computing platforms to solve complex problems with cross-disciplinary teams.

More recently, Joshua has implemented and open-sourced cadCAD (complex-adaptive dynamics Computer Aided Design), a unified modeling framework in Python for digital-twin and SocioTechnical system design and implementation used to improve the decentralized web. cadCAD is used for dynamic systems model design & encoding and has been deployed on Kubernetes as a scalable, parallelized, concurrent, and distributed stochastic simulations solution that accommodates grey box machine learning integrations and user behavior analysis. Systems Engineers, Data Scientists, and Economic Researchers use it to iteratively design, enhance, and adapt grey box system models of decentralized networks and user behavior on networks throughout product life-cycles of DeFi services, Blockchains, DAOs, and other complex systems. Joshua also adapted cadCAD for a scalable message simulation and publishing framework inspired by AWS' IoT Device Simulator using Apache (Py)Spark, Kafka Producer, JupyterHub, and AWS to enable the verification of real-time data processing & message throughput bench-marking to optimize Kafka cluster configurations for throughput spikes.

Prior to this, Joshua collaborated with the Director of Data Science (Michael Zargham) and the VP of Strategic Technology (David Sisson), and the Data Science & Engineering teams to implement a Big Data pipeline with a custom Predictor-Corrector Machine Learning Ensemble model and evaluation products to forecast viewership in AdTech using Apache Spark, Databricks, and AWS. He also ensured the Director of Data Science's ability to conduct scalable rapid prototyping with Spark by creating and managing Spark Docker deployments. Joshua also contributed to a distributed data pipeline framework written in Scala for multi-disciplinary collaboration on scalable machine leaning enabled Big Data pipelines using Apache Spark. This framework was used as a Supply Chain management tool to minimize inventory risk for pharmaceutical manufacturers. He also conducted award-winning Machine Learning research enabling universities to forecast student performance in online courses (MOOCs) and learning management systems.

David Sisson: David Sisson has extensive experience designing and developing commercial software systems. Most recently he built and led Data Engineering and Data Science at an AdTech firm during their transition from a service-oriented business to platform provider. He has been a full product life cycle developer, and he has built teams necessary to scale these processes. He has experience building automated, data-driven, general business systems and data-acquisition/analysis applications in scientific and engineering settings. His tech career is informed by a Ph.D. in Neurophysiology and nine years of research experience in academic and pharmaceutical settings with a focus on using computational neuroscience to bridge micro (molecular and cellular) and macro (behavioral).

Michael Zargham: Michael Zargham is the founder and CEO of BlockScience. Dr. Zargham holds a Ph.D. in systems engineering from the University of Pennsylvania where he studied optimization and control of decentralized networks. His earliest work on peer-to-peer effects in business decision making was developing algorithms to reverse engineer the word of mouth effect in enterprise software licensing decisions in 2005. In the intervening years, Dr. Zargham has designed data driven decision systems and built a data science team for a Media Technology firm, worked on the mathematical specifications of blockchain enabled software systems with a focus on observability and controllability of the information state of the networks.

Team code repositories

CATs POC: https://github.com/BlockScience/cats

Additional Information

We learned about the Open Grants Program from David Aronchick (Protocol Labs) and Wes Floyd (Protocol Labs)

Points of Contact: david@block.science joshua@block.science nick@block.science

About BlockScience: BlockScience ® is a complex systems engineering, R&D, and analytics firm. Our goal is to combine academic-grade research with advanced mathematical and computational engineering to design safe and robust socio-technical systems. With deep expertise in Blockchain, Token Engineering, AI, Data Science, and Operations Research, we provide engineering, design, and analytics services to a wide range of clients, including for-profit, non-profit, academic, and government organizations.

Our team is interdisciplinary, with engineering teams operating in concert with social science and governance research teams of ethnographers and economists. This enables us to provide more holistic system requirements gathering, design, and data-driven decision making infrastructure.

We use a hybrid of artificial intelligence and computational social science to develop, teach, and apply best practices in economic systems engineering. Our teams work closely in partner engagements to build internal capacity, educate, and co-create. Our research and development occurs in iterative cycles between open-source science and applied science in client-based consulting projects.

BlockScience specializes in data science and computational methods focusing on algorithm design problems with complex human behavior implications. Our work includes pre-launch design and evaluation of economic and governance mechanisms based on research, simulation, and analysis. We also provide post-launch monitoring and maintenance via reporting, analytics, and decision support software. With our unique blend of engineering, data science, and social science expertise, the BlockScience team aims to diagnose and solve today’s most challenging frontier socio-technical problems.

ErinOCon commented 11 months ago

Hi @JEJodesty, thank you for your proposal and for your patience with our review process. Unfortunately, we will not be moving forward with a grant at this time.

Wishing you all the best as you continue to build!