filecoin-project / devgrants

👟 Apply for a Filecoin devgrant. Help build the Filecoin ecosystem!
Other
377 stars 308 forks source link

Application from NGPU #1796

Open ThornbirdZhang opened 2 months ago

ThornbirdZhang commented 2 months ago

Open Grant Proposal: NGPU -- AI DePin

Project Name: NGPU

Proposal Category: Integrations

Individual or Entity Name: Metadata Labs Inc.

Proposer: Alain Garner

Project Repo(s) https://github.com/NGPU-Community/ngpu-cli and . https://github.com/NGPU-Community/ngpu-business-backend, https://github.com/NGPU-Community/ngpu-contract, https://github.com/NGPU-Community/ngpu-node-client and so on.

(Optional) Filecoin ecosystem affiliations: None

(Optional) Technical Sponsor: None

Do you agree to open source all work you do on behalf of this RFP under the MIT/Apache-2 dual-license?: Yes

Project Summary

NGPU means Next-GPU. With the rise of the AI wave, a multitude of applications have emerged. However, the cost of AI inference powered by GPUs has skyrocketed, and these resources are largely monopolized by big companies, significantly hindering the equitable development of various applications, especially for creative startup teams. Meanwhile, many GPU resources remain idle.

NGPU, as a decentralized GPU computing network, is dedicated to providing accessible GPU nodes without any entry barriers, offering cost-effective, user-friendly, and stable GPU computing resources to various AI applications. This enables enterprise-level AI inference services while also offering idle GPU nodes an opportunity to generate income, making full use of every resource.

Impact

AI inference relies on large-scale models and massive datasets, with files often reaching tens or even hundreds of gigabytes in size. Ensuring the stable and reliable storage and management of these vast datasets has become one of the major challenges in the AI Depin industry. Traditional Web2 storage solutions often face issues such as data tampering, high costs, and access delays when handling such large-scale data. These problems not only affect the efficiency of AI inference but can also lead to data loss or corruption, posing significant risks to the stability and reliability of the entire system.

To address this challenge, we are exploring more reasonable storage solutions to support the application of large models and big data in AI inference. After thorough research, the unique technologies and services of IPFS and Filecoin have captured our attention.

We will be storing all the data of the Machine learning lifecycle onto IPFS to ensure that there is an audit trail, we would like to backup the Data onto Filecoin to make it more permanent, The Dataset tagged to the Machine Learning Lifecycle including it's featured will be stored as well to ensure that Models deployed in production can be audited for fairness further down the road.

Outcomes

With the rise of the AI wave, various applications are emerging one after another. However, the price of AI inference computing power centered around GPUs is soaring and monopolized by large companies, greatly hindering the equal development of various applications, especially creative start-up teams. At the same time, many GPU resources remain idle. NGPU, as a Web3 decentralized GPU computing power network, is dedicated to providing accessible GPU nodes without any entry barriers, offering cost-effective, user-friendly, and stable GPU computing power resources for various AI applications, thereby achieving enterprise-level AI inference services. Simultaneously, it provides idle GPU nodes with opportunities to earn money, fully utilizing every bit of resource.

NGPU main functions include:

The followings are 3 parts correponding to the above functions.

Compared to other Depin projects that serve as just renting separate GPU computing nodes, NGPU's main innovative technologies include intelligent allocating, pay-per-task GPU usage billing, and efficient HARQ network transmission. These technologies enable real-time perception of AI application load pressure, dynamic adjustment of resource usage, and achieve auto-scaling, security, high reliability, and low latency on GPU nodes with unrestricted access.

During the development of NGPU, we encountered the following major issues, and set up our own technologies.

  1. Instability of Decentralized Computing Nodes: Compared to the reliable service quality of high-grade IDCs built by big companies, GPU nodes with permissionless access might be personal gaming computers, which are highly likely to go offline randomly. To address this, NGPU developed the Smart Allocation framework, which on one hand, monitors the status of each GPU node in real-time and configures redundant nodes besides the working nodes to switch when a working node goes offline; on the other hand, it designed an incentive and staking mechanism to encourage stable online presence.

  2. Various Specifications of Computing Node Networks and Hardware: Facing various specifications of GPU computing nodes, NGPU measures the AI inference capability of the nodes and combines it with the measurement of storage and network bandwidth to form a unified computing power value. This standardizes the node computing power, providing a basis for Allocation and incentives. Additionally, NGPU utilizes HARQ (Hybrid Automatic Repeat reQuest) to optimize the efficiency of public network transmission, achieving over a 100-fold speed improvement under strong network noise, compensating for the network deficiencies of decentralized computing nodes.

  3. Significant Daily Fluctuations in AI Application Loads: Various AI applications, especially in the Web3 field, face load peaks and valleys. Typically, GPU nodes are rented based on peak loads to ensure service stability. However, NGPU calculates costs based on the actual usage of (GPU computing power * duration) through smart allocation, ensuring that every penny spent by the AI application provider goes towards their actual business needs. This not only enhances usage efficiency on the relatively low-cost decentralized GPU power but also significantly reduces GPU computing power costs, achieving fair access to GPU computing power.

Adoption, Reach, and Growth Strategies

The Most Valued Customers of NGPU are 1: Individual Developer and Small B-end Development Team who need AI computing power 2: Individual and organization who own computing power

As AI continues to develop, the demand for compute power in AI inference will increase significantly. However, most individual developers and small B-end development teams face the challenges of being unable to accurately estimate compute consumption and having to develop numerous non-core business modules when creating AI-based products. Building a decentralized elastic computing power network and offering a wide range of open-source model functionalities along with various SDKs can effectively address these pain points.

We will build a decentralized GPU network with pooled computing power to provide developers and projects requiring AI computing resources with low-cost and easily accessible power. At the same time, we will integrate Filecoin technology to achieve true decentralization for AI.

We have already built a decentralized GPU network and now need to deploy it on Filecoin for incentive system storage and distribution. In terms of go-to-market strategy, we will build a business development team to reach out to those in need of computing power. Additionally, we are collaborating with major mining hardware manufacturers and computing power providers to secure a substantial amount of GPU resources. Profit is generated by charging individual developers and small B-end development teams for compute usage and SDK fees.

On the other hand, NGPU's competitors include decentralized computing cloud projects such as IO.net and Akash. The key difference between NGPU and these platforms is that IO.net and Akash require users to rent entire computing resources, while NGPU provides computing power at the AI task level. This helps address the "compute anxiety" faced by users who struggle to accurately estimate their compute needs. Additionally, NGPU offers various open-source AI agent interfaces to compute power users, transforming the decentralized platform from a traditional compute rental service into a computing power service provider.

Development Roadmap

Here are the 3 stages for integrating NGPU with Filecoin.

  1. Infrastructure Changes The infrastructure must be adapted to integrate IPFS as the primary storage backend. This involves setting up IPFS nodes to ensure reliable and efficient storage, retrieval, and management of data. The team will configure IPFS gateways and APIs to facilitate seamless data interaction between the NGPU network and IPFS. This step includes ensuring that IPFS nodes are optimized for performance, have sufficient redundancy, and are correctly scaled to handle the anticipated data volume. The team will also need to establish monitoring tools to oversee IPFS storage health, data integrity, and performance.

  2. Client Modifications The client software must be modified to interact with IPFS instead of traditional node-based storage. This includes updating the data upload, retrieval, and management functions to work with IPFS protocols such as content addressing and peer-to-peer data exchange. The client will need enhancements to handle IPFS hashes, ensuring that data references remain consistent across the system. Additionally, security measures will be updated to ensure that data permissions and access controls are maintained within the IPFS environment. These changes will require rigorous testing to confirm that data handling remains efficient and error-free.

  3. AI Container Adaptations AI containers that perform training and inference tasks must be updated to fetch and store data directly from IPFS. This involves modifying the data ingestion pipelines within the AI workflows to interact with IPFS nodes seamlessly. The containers will need to be reconfigured to ensure compatibility with IPFS’s content-addressable storage model, allowing for efficient and secure data retrieval during AI computations. The team will also optimize data access patterns to reduce latency and improve overall system performance, especially during high-load scenarios where rapid access to large datasets is critical.

  4. Team Allocation and Timeline To complete these modifications, the team of four—comprising infrastructure engineers, software developers, and AI specialists—will work in parallel across the different components. Each team member will be responsible for a specific area: one will focus on infrastructure integration, one on client software changes, one on AI container adjustments, and one on testing and quality assurance. The timeline of one month will be structured into phases, including initial planning, development, integration, testing, and deployment. Regular syncs will ensure that all components are compatible and that the migration to IPFS is smooth and successful.

  1. Multi-backup Strategy: To test Filecoin’s reliability, data will be stored redundantly across multiple nodes within the Filecoin network. This setup will allow for real-world testing of Filecoin’s storage capabilities, including data retrieval speeds, fault tolerance, and the integrity of stored data.

  2. Performance and Reliability Testing: Throughout the migration process, continuous monitoring and performance testing will be conducted to assess Filecoin’s transmission speeds and storage consistency. This evaluation will include stress tests under various network conditions to ensure Filecoin meets the NGPU computing network's performance requirements.

  3. Progressive Data Allocation: The migration will be carefully managed to incrementally increase the amount of data stored on Filecoin, with the target of 33%. This phased approach allows for continuous validation and adjustment, ensuring that any challenges are addressed without compromising data accessibility. By implementing this strategy, the NGPU computing network aims to leverage Filecoin’s decentralized storage advantages while ensuring data reliability and performance, laying the groundwork for broader future adoption.

  1. Cost Optimization: Storage costs across different nodes and Filecoin will be analyzed to identify the most cost-effective configuration. By comparing the price-per-gigabyte and associated transaction costs, the system can dynamically allocate data to the lowest-cost options while maintaining the required performance standards.

  2. Reliability Tuning: The reliability of each node and Filecoin’s network will be assessed by tracking historical uptime, error rates, and successful data retrievals. Data redundancy will be strategically applied to less reliable nodes, whereas more reliable nodes will handle critical or sensitive information, reducing the overall risk of data loss.

  3. Dynamic Storage Allocation: The storage system will be adjusted dynamically based on real-time performance and cost analytics. The system will implement automated policies that distribute data between Filecoin and node storage based on the current network conditions, ensuring an optimal balance between speed, cost, and reliability.

  4. Routine Use and Scaling: As the optimized strategy proves effective, the NGPU network will integrate Filecoin into its routine operations, scaling up its usage as part of a hybrid storage model. This approach will ensure a scalable and flexible storage environment capable of adapting to changing needs and workloads.

By refining the storage strategy, finally NGPU network will achieve a more efficient, reliable, and cost-effective system, fully utilizing the benefits of Filecoin alongside traditional node storage.

Total Budget Requested

| Milestone # | Description | Deliverables | Completion Date | Funding |

Maintenance and Upgrade Plans

We are keeping NGPU Maintenance and Upgrade from the following 10 aspects.

  1. Community Feedback and Collaboration: Actively gather feedback from the user community, incorporate suggestions, and collaborate with stakeholders to refine and improve the network continuously.

  2. Regular System Updates: Ensure continuous updates of system software, GPU drivers, and libraries to maintain compatibility, performance, and security.

  3. Scalability Enhancements: Implement upgrades to support scaling the network efficiently, accommodating more GPUs, nodes, and computing tasks as demand grows.

  4. Performance Optimization: Continuously optimize the network's algorithms and workload distribution strategies to improve computational speed, reduce latency, and maximize GPU utilization.

  5. Fault Tolerance and Redundancy: Enhance fault tolerance mechanisms, including automated failover, load balancing, and data redundancy, to ensure high availability and minimize downtime.

  6. Security Upgrades: Regularly update security protocols, including data encryption, access controls, and monitoring, to protect against cyber threats and unauthorized access.

  7. Resource Management Improvements: Upgrade resource allocation and scheduling systems to better manage GPU workloads, prioritize tasks, and maximize efficiency.

8, Monitoring and Analytics Tools: Develop advanced monitoring and analytics tools to provide real-time insights into network performance, detect anomalies, and predict maintenance needs.

9, User Interface and Accessibility: Continuously improve user interfaces and APIs to enhance accessibility, ease of use, and integration capabilities for developers and end-users.

10, Energy Efficiency Initiatives: Implement energy-saving strategies, including optimizing power usage across the network to reduce operational costs and environmental impact.

Here is the our plan.

Team

Team Members

Team Member LinkedIn Profiles

Alain Garner: https://www.linkedin.com/in/alaingarner/

Team Website

Website is https://ngpu.ai/.

Relevant Experience

The NGPU team members possess in-depth knowledge of GPU clusters and comprehensive project experience. Over the past six years, Alain has been involved in multiple Web3 startups, successfully leading projects such as Comtech. Meanwhile, Gene and Ivan, as core members of the Google TPU team, have successfully built large-scale AI training and inference compute clusters. These are essential elements for the success of the AI Depin project and will undoubtedly drive NGPU toward its goals.

Team code repositories

Here are repositories of some competitors' projects,

  1. IO.net, https://io.net/, https://github.com/ionet-official
  2. Tao bittensor, https://bittensor.org/, https://github.com/opentensor
  3. Akash, https://akash.network/, https://github.com/akash-network

Additional Information

We learnt about your great Open Grants Program from our social network. Here is the email, info@ngpu.ai. Our twitter is @ngpu_ai, where we post our active info.

ErinOCon commented 1 month ago

Hi @ThornbirdZhang, thank you for your proposal! Can you confirm the number of users onboarded and any projected numbers that apply for this project?

ThornbirdZhang commented 1 month ago

@ErinOCon Thank you for your review.

Currently, the NGPU network has nearly 100 computing power nodes and has integrated with more than a dozen project teams and individual developers to provide GPU computing power, including ParityBit, DigiForge, ANVD, and etc (https://www.ngpu.ai). It also supports direct integration with the NGPU AI API (https://ngpu.readme.io/reference/gettask). The daily machine usage rate exceeds 40%, and the NGPU AI API is called approximately 3,000 times per day.

We expect that by the end of Q1 2025, the number of partners will increase to over 50, the machine usage rate will exceed 50%, and the daily API calls will surpass 8,000.

ErinOCon commented 1 month ago

Thank you @ThornbirdZhang!

ErinOCon commented 1 week ago

Hi @ThornbirdZhang, I hope you are doing well! Your project is currently shortlisted as a review candidate. If we have any remaining questions, we will contact you on this thread.

If you have any questions on your end, please feel welcome to be in touch at grants@fil.org. We would be happy to connect.

ThornbirdZhang commented 1 week ago

Thank you very much! We are eager to use Filecoin as our basic storage infrastructure.