bigcode-project / starcoder2

Home of StarCoder2!
Apache License 2.0
1.71k stars 158 forks source link

What prevents you from throughly opensourcing? #4

Closed yucc-leon closed 6 months ago

yucc-leon commented 6 months ago

I noticed that even though bigcode/starcoder(2) is much opener than code llama and deepseekcoder, eg. open-sourced datasets, clearly described data processing and training, and so on, it is still not thoroughly open; code used for pretraining and data processing has never been open-source. So just out of curiosity, what prevents you from that?

UniverseFly commented 6 months ago

Just want to point out that the data processing pipeline is open-source (https://github.com/bigcode-project/the-stack-v2). It is also the case for StarCoder1 (https://github.com/bigcode-project/bigcode-dataset/). To my knowledge, StarCoders are the only code LLMs with such a great transparency.

yucc-leon commented 6 months ago

Just want to point out that the data processing pipeline is open-source (https://github.com/bigcode-project/the-stack-v2). It is also the case for StarCoder1 (https://github.com/bigcode-project/bigcode-dataset/). To my knowledge, StarCoders are the only code LLMs with such a great transparency.

Wow I just found this repo and sorry for my ignorance...

udaygiri commented 6 months ago

4

The reasons for not fully open-sourcing pretraining and data processing code in projects like BigCode/StarCoder(2) may include:

Intellectual Property: Protecting unique innovations or proprietary techniques.

Security Concerns: Preventing misuse of powerful AI models.

Quality and Reputation: Ensuring the quality of the code and avoiding negative impacts from misuse.

Resource Constraints: The high resource requirement for supporting an open-source project.

Legal Agreements: Restrictions due to collaborations or partnerships.

Data Privacy: Compliance with legal constraints related to data privacy and copyright.

These factors balance transparency with practical concerns like security, legal, and resource management.

yucc-leon commented 6 months ago

4 The reasons for not fully open-sourcing pretraining and data processing code in projects like BigCode/StarCoder(2) may include:BigCode/StarCoder(2) 等项目中未完全开源预训练和数据处理代码的原因可能包括:

Intellectual Property: Protecting unique innovations or proprietary techniques.知识产权:保护独特的创新或专有技术。

Security Concerns: Preventing misuse of powerful AI models.安全问题:防止滥用强大的人工智能模型。

Quality and Reputation: Ensuring the quality of the code and avoiding negative impacts from misuse.质量和声誉:确保代码的质量并避免滥用造成的负面影响。

Resource Constraints: The high resource requirement for supporting an open-source project.资源限制:支持开源项目对资源的要求很高。

Legal Agreements: Restrictions due to collaborations or partnerships.法律协议:由于合作或伙伴关系而产生的限制。

Data Privacy: Compliance with legal constraints related to data privacy and copyright.数据隐私:遵守与数据隐私和版权相关的法律约束。

These factors balance transparency with practical concerns like security, legal, and resource management.这些因素平衡了透明度与安全、法律和资源管理等实际问题。

Thanks a lot.