Create a manifest file for training an LLM

Jjing-Liang commented 5 months ago

Type of project

Writing content about Impact Framework

Overview

Large Language Model(LLM) has evolved rapidly over the past year and has shown great potential in many areas. At the same time, we cannot ignore its impact on the environment. Therefore, we would like to conduct a survey on the existing LLMs to generate a manifest file that could be used to calculate the energy and carbon consumed by LLM training. The actual models/calculations are not requirements here - it is enough to create a manifest file that could run as this will expose gaps in the current stack.

The manifest MUST be a single .yaml file
The manifest file MUST have all the required fields described here
The manifest MUST define the necessary inputs required for the calculation
The manifest MUST include all necessary models in a pipeline
The manifest MUST define the expected outputs and their units
The manifest SHOULD include comments showing which models are necessary but currently unavailable
The manifest SHOULD include some documentation in a separate README explaining the execution flow, any known problems or missing components.

Questions to be answered

At the moment we have no questions. If there are relevant suggestions or resources, feel free to leave a message.

Have you got a project team yet?

Yes and we aren't recruiting

Project team

Team members: @Xiaoliang1122, @Irenia111, @Jjing-Liang

Terms of Participation

[X] I agree to the hackathon Rules & Terms and Code of Conduct

Project Submission

Summary

This tutorial outlines methods for estimating the carbon emissions of Large Language Models (LLMs) during training and inference using the Impact framework tool. It offers manifest examples for various levels of emission estimates and helps compare emissions across different LLM configurations, promoting carbon reduction. An updated dataset with LLM carbon emissions papers and related public data is provided for convenient calculation, aiming for a sustainable AI-environment future.

Problems

Large Language Models(LLMs) have evolved rapidly over the past year and have shown great potential in many areas. At the same time, we cannot ignore its impact on the environment. The current version of the Impact Framework lacks a comprehensive plugin designed to accurately calculate the carbon emissions generated by Large Language Models (LLMs). This deficiency creates a challenge for users seeking to understand and manage the environmental impact of their AI-driven operations. Our proposed solution leverages the capabilities of the Impact framework to define a manifest specifically tailored for the assessment of carbon emissions from LLMs. This will enable users to gain insights into the constituents of LLMs carbon footprints and identify the critical factors influencing them. Additionally, our research will involve the compilation of a readily accessible dataset of public data pertinent to LLMs computation, facilitating straightforward lookup and application for those aiming to evaluate and mitigate the carbon emissions associated with their AI models.

Application

Our solution serves as a tutorial guide for calculating the LLMs carbon footprint, consisting of a collection of manifest files and explanatory articles. Collaborating with the Impact Framework, we illustrate the carbon footprint of LLMs using IF plugins. By supplying simple data input, you can achieve a comprehensive understanding of LLMs carbon emissions. The manifests and accompanying explanatory content will facilitate your exploration of the LLMs carbon footprint structure and assist in comprehending each emission component. An updated dataset with LLMs carbon emissions papers and related public data is also provided for convenient calculation.

Prize category

Best content

Judging criteria

Overall Impact: The tutorial's approach to evaluating LLMs carbon emissions stands to bolster sustainability efforts, offering a methodology for assessing AI's environmental footprint. By equipping users with the Impact Framework to make eco-conscious decisions, it fosters more sustainable AI development and promotes a shift towards greener technology practices. To actualize this impact, the tutorial requires broad outreach to engage the AI and sustainability sectors, integration with the Impact Framework for ongoing support, and persistent research for methodological refinement and data expansion.

Clarity: The tutorial clarifies the complex subject of LLMs carbon emissions with well-structured guidance and relatable examples. Its use of visual aids and plain language ensures that the material is comprehensible to a wide range of users, from seasoned professionals to those new to the field, making the information both accessible and engaging.

Innovation: Innovative in both concept and execution, the tutorial breaks new ground in AI sustainability by harnessing the Impact framework for carbon emission analysis in LLMs. The introduction of diverse manifest examples not only streamlines the estimation process but also educates users on the nuances of carbon footprint assessment, marking a significant advancement in the field.

Process

The development process of this tutorial involved thorough research and analysis of the environmental impact of LLMs and existing carbon emissions evaluation tools. The Impact Framework was extensively studied to understand its functionalities and applicability. Based on the research and framework analysis, multiple manifests with different granularity were designed to calculate LLMs carbon emissions, considering factors such as server energy consumption, training time, data transfer, and manufacturing costs. This tutorial provided users with a simple and comprehensive method to calculate and assess the carbon emissions of their LLMs models. The tutorial underwent testing and optimization to ensure accuracy and reliability.

Inspiration

Our inspiration for developing this tutorial came from three key sources. Firstly, the growing awareness of environmental issues and the need for sustainable development motivated us to address the environmental impact of artificial intelligence, particularly LLMs. Secondly, the widespread application of LLMs in various domains highlighted the potential environmental challenges associated with their resource-intensive nature. Lastly, the discovery of the Impact Framework's capabilities in environmental assessment provided us with the inspiration to write a tutorial that integrates with the framework to evaluate LLMs carbon emissions. By combining these motivations, we aimed to contribute to the sustainable development of AI by providing a comprehensive solution for assessing and managing LLMs carbon footprints.

Challenges

We encountered several challenges during the process, including data availability and collection complexities for evaluating the environmental impact of LLMs. Understanding the complex operations of LLMs and accurately evaluating their environmental impact required expertise in AI and environmental evaluation methodologies. Integrating the evaluation framework with LLMs proved challenging, necessitating customization to align with their specific requirements. However, through extensive research, collaboration, and optimization, we successfully addressed these challenges, resulting in a solution for assessing and managing the environmental impact of LLMs.

Accomplishments

We are proud of our achievement in rapidly constructing the LLMs evaluation and comprehending its relationship with carbon emissions. Despite starting from scratch, we dedicated substantial time and effort to quickly acquire knowledge and apply it effectively, yielding impressive outcomes. Throughout the process, we overcame technical and theoretical hurdles through extensive research and practical experimentation. Our team demonstrated remarkable adaptability and learning capabilities, enabling us to complete the task efficiently. Ultimately, our pride stems from our ability to swiftly build the LLMs manifest and understand the complexities of carbon emissions. This accomplishment showcases our team's commitment and competence, establishing a solid foundation for future endeavors.

Learnings

Through hacking, our team has acquired valuable skills and insights. We understand the factors influencing carbon emissions in large language models. It has improved our ability to gather and organize information effectively, enabling informed decision-making. Utilizing the IF framework, we navigate complex challenges with logical statements and conditions. Our problem-solving and critical thinking skills also have sharpened, allowing us to approach challenges creatively. Overall, this journey has provided us with invaluable knowledge and skills.

What’s next

Our solution aims to have a lasting impact on the Impact Framework ecosystem by deepening our understanding of carbon emissions in LLMs, optimizing processes through advanced technologies, and ensuring scalability and replicability.

We hope we can conduct continuous research and data analysis to gain a deeper understanding of the factors that influence carbon footprints in LLMs. This knowledge will enable us to develop targeted strategies for reducing emissions. Additionally, we want to streamline processes and enhance data analytics capabilities through the integration of advanced technologies and intelligent tools. This will improve the efficiency and accuracy of data collection, organization, and analysis, allowing participants to effectively monitor and evaluate their carbon emissions. Furthermore, our solution will be designed to be scalable and replicable across different scales, regions, and sectors. This will facilitate its widespread adoption and replication, maximizing its impact and encouraging more stakeholders to embrace low-carbon practices.

In summary, our solution will contribute to the long-term development of the Impact Framework ecosystem by deepening our understanding of carbon emissions, optimizing processes, and ensuring scalability and replicability. Through these efforts, we will drive the adoption of sustainable practices and contribute to a more sustainable future.

jawache commented 5 months ago

Hi @Jjing-Liang, thank you for this excellent submission!

Given the intention is not to create any new plugins I believe the best category for you to submit this project is under "Best Content".

My suggestion would be to treat this perhaps as a tutorial where you teach someone "how to" calculate the emissions of an LLM through a worked example.

Happy to support and guide with some approaches for measuring LLMs.

Are you thinking of including inference as well as training?

Do you have any data sets/observations you have already gathered about some LLM training?

If so we can suggest the plugins we would use to convert those to carbon.

Jjing-Liang commented 5 months ago

Hi @jawache, Thanks for the suggestion! I've changed it to "Writing content about Impact Framework".

Regarding the "tutorial" mentioned, it's a good idea, we were thinking of organizing and summarizing from various articles/papers to get a manifest, but we can indeed try to make a demo, which can guide readers better on how to evaluate the carbon emissions of training a large model.

Our team has just started searching for relevant articles and papers, and we haven't summarized them yet. So the data sets/observations side is vacant. We will leave comment if there is any update on this, and we would certainly welcome any available plugins suggestion.

Regarding whether to include inference, our idea is to start with training, because not everyone in the team is familiar with LLM, so a smaller scope is more conducive to cooperation and output results. However, I will also bring this idea back to the team, which is a good direction for us.

One last question we want to ask is: we found the example file msft-green-ai.yaml under the IF repository, can we use this example as a reference point for our work?

Green-Software-Foundation / hack