Open will-ww opened 2 years ago
根据之前娄学长在 yuque 中制作的 guide 成功使用了 opendigger 并进行了相应研究; 其中,娄学长的参考代码中,
class ConnDB():
...
def query_clickhouse(self):
...
self.rs, column_types = client.execute(self.sql, with_column_types=True)
...
此处有一个变量名书写错误,已修改该错误如下:
class ConnDB():
...
def query_clickhouse(self):
...
self.rs, column_types = self.client.execute(self.sql, with_column_types=True)
...
最终效果如下:
opendigger 中包含的特征描述中并没有特征和 dependency 相关,现在考虑以下两个路径:
综合以下特征建模:
repo_forks_count(仓库被 fork 的次数)
# fork
conndb.sql = '''SELECT repo_name, COUNT() AS forks
FROM github_log.events WHERE type = 'ForkEvent'
GROUP BY repo_name
ORDER BY forks DESC
LIMIT 20
'''
conndb.execute()
rs = conndb.df_rs
print(rs)
[Out]
repo_name forks
0 jtleek/datasharing 211367
1 octocat/Spoon-Knife 152126
2 rdpeng/ProgrammingAssignment2 132934
3 tensorflow/tensorflow 105201
4 github/gitignore 98386
5 SmartThingsCommunity/SmartThingsPublic 92167
6 LSPosed/MagiskOnWSA 89289
7 twbs/bootstrap 85463
8 nightscout/cgm-remote-monitor 79427
9 barryclark/jekyll-now 75867
10 Pierian-Data/Complete-Python-3-Bootcamp 74767
11 jwasham/coding-interview-university 59130
12 eugenp/tutorials 57992
13 opencv/opencv 56900
14 github/docs 56249
15 tensorflow/models 54978
16 facebook/react 54925
17 rdpeng/ExData_Plotting1 53421
18 firstcontributions/first-contributions 52779
19 jlord/patchwork 52512
repo_stargazers_count (仓库被 star 的次数)
# star
conndb.sql = '''SELECT repo_name, COUNT() AS stars
FROM github_log.events WHERE type = 'WatchEvent'
GROUP BY repo_name
ORDER BY stars DESC
LIMIT 20
'''
conndb.execute()
rs = conndb.df_rs
print(rs)
[Out]
repo_name forks
0 996icu/996.ICU 270415
1 vuejs/vue 227612
2 sindresorhus/awesome 227307
3 FreeCodeCamp/FreeCodeCamp 224461
4 kamranahmedse/developer-roadmap 217349
5 facebook/react 214558
6 jwasham/coding-interview-university 209954
7 donnemartin/system-design-primer 196127
8 tensorflow/tensorflow 193686
9 freeCodeCamp/freeCodeCamp 179361
10 EbookFoundation/free-programming-books 171178
11 getify/You-Dont-Know-JS 162998
12 flutter/flutter 159709
13 TheAlgorithms/Python 152362
14 public-apis/public-apis 152056
15 trekhleb/javascript-algorithms 151493
16 danistefanovic/build-your-own-x 143228
17 torvalds/linux 138606
18 vinta/awesome-python 138026
19 jackfrued/Python-100-Days 136214
issues_total_lengths (仓库包含的 issues 层数总和,注:该特征需要自己统计,可以通过针对所有 issue_id 的 issuecomments 求和,即 $$ \sum{issue_id} issue_comments $$ 来完成)
# issue total lengths
conndb.sql = '''SELECT repo_name, SUM(issue_comments) AS issuecommentscount
FROM github_log.events WHERE type = 'IssuesEvent'
GROUP BY repo_name
ORDER BY issuecommentscount DESC
LIMIT 20
'''
conndb.execute()
rs = conndb.df_rs
print(rs)
[Out]
repo_name issuecommentscount
0 cxflowtestuser/VB_3845 45368466
1 openshift/origin 1114289
2 kubernetes/kubernetes 532699
3 vueComponent/ant-design-vue 453512
4 quarkusio/quarkus 291745
5 flutter/flutter 290014
6 BeardedTinker/Home-Assistant_Config 270182
7 AdguardTeam/AdguardFilters 257230
8 microsoft/vscode 257093
9 MicrosoftDocs/azure-docs 231377
10 golang/go 219295
11 Microsoft/vscode 217437
12 tensorflow/tensorflow 184577
13 OpenTermsArchive/contrib-declarations 164716
14 openjournals/joss-reviews 164462
15 elastic/kibana 157710
16 magento/magento2 147019
17 rust-lang/rust 144809
18 ansible/ansible 134267
19 home-assistant/core 131854
release_downloadedcount (仓库 release asset 被下载的总数,注:该特征需要自己统计,可以通过累加每一个 release 的 assets 的 被下载次数总和的总和计算统计,即 $$ \sum{releaseid}[ \sum{release_assets}( release_assets.download_count ) ]$$。
# download sum
conndb.sql = '''SELECT repo_name, sum(arraySum(release_assets.download_count)) as downloadsum
FROM github_log.events WHERE type = 'ReleaseEvent'
GROUP BY repo_name
ORDER BY downloadsum DESC
LIMIT 20
'''
conndb.execute()
rs = conndb.df_rs
print(rs)
[Out]
repo_name downloadsum
0 pwn20wndstuff/Undecimus 8893581
1 rulosoft/rulo 2672807
2 youtvdev/youtv 773049
3 playviewdev/playview 727638
4 liquibase/liquibase 553060
5 RocketChat/Rocket.Chat.Electron 491455
6 AKhaMae/December 416995
7 atom/atom 401482
8 jgraph/drawio-desktop 388604
9 microsoft/vscode-cpptools 252364
10 meetfranz/franz 249828
11 AtlasNX/Kosmos 235319
12 electron/electron 231603
13 PhocaCz/PhocaGallery 215677
14 visualboyadvance-m/visualboyadvance-m 205073
15 getinsomnia/insomnia 200336
16 jp9000/obs-studio 198483
17 syncthing/syncthing 194396
18 Blazemeter/CorrelationRecorder 193475
19 FAForever/downlords-faf-client 188052
根据遍历 Github 来统计其中提及的 repo 来计算(运算量过大,仍需要优化)
针对 Libraries.io 中,我们发现其包含 32 种开源包管理系统的依赖关系统计数据,统计更新时间最后在 2020 年1 月,更新频率为 2 年 1 次,如下图显示
所以,该数据实时性不高,我们主要可以参考其爬取方法;
OpenChain 项目主要是一个确权项目,通过一系列调查问卷(全部为 y/n 选项)来确定适合项目/组织的 License,我们可以参考其问卷内容来判断各项目权限是否合理(部分判断依据很难获得,需要进一步研究);
针对 SLSA 项目,还需要更多了解。
issues_total_length -> issues_total_stars
3. issues_total_lengths (仓库包含的 issues 层数总和,注:该特征需要自己统计,可以通过针对所有 issue_id 的 issuecomments 求和,即 $$ \sum{issue_id} issue_comments $$ 来完成)
# issue total lengths conndb.sql = '''SELECT repo_name, SUM(issue_comments) AS issuecommentscount FROM github_log.events WHERE type = 'IssuesEvent' GROUP BY repo_name ORDER BY issuecommentscount DESC LIMIT 20 ''' conndb.execute() rs = conndb.df_rs print(rs) [Out] repo_name issuecommentscount 0 cxflowtestuser/VB_3845 45368466 1 openshift/origin 1114289 2 kubernetes/kubernetes 532699 3 vueComponent/ant-design-vue 453512 4 quarkusio/quarkus 291745 5 flutter/flutter 290014 6 BeardedTinker/Home-Assistant_Config 270182 7 AdguardTeam/AdguardFilters 257230 8 microsoft/vscode 257093 9 MicrosoftDocs/azure-docs 231377 10 golang/go 219295 11 Microsoft/vscode 217437 12 tensorflow/tensorflow 184577 13 OpenTermsArchive/contrib-declarations 164716 14 openjournals/joss-reviews 164462 15 elastic/kibana 157710 16 magento/magento2 147019 17 rust-lang/rust 144809 18 ansible/ansible 134267 19 home-assistant/core 131854
由于每一个 issue 的 comment 也都可以被 star,考虑将整体的 issue 的统计改为对 length 和 stars 的一个整合指标。
不过在 opendigger 中并没有 issue commit stars 的相关数据,#TODO: 需要帮助。
相关工作:
- 实验室通过两年多的建设,已经初步搭建了一套数据基础设施,用来采集、存储、分析 GitHub上面的全域日志数据:https://github.com/X-lab2017/open-digger
- 类似 https://libraries.io/ 这样的平台,提供了不同语言下项目之间的依赖信息,为风险模型的建立提供了有用信息;
- Linux Foundation 旗下的 OpenChain 项目,提供了包括许可证兼容与合规方面的标准与工具;
- Linux Foundation 和 Google 也联合发起了类似 Slsa 的开源项目,为软件供应链的安全问题提供支持。
GIthub 的 Dependabot 支持:
Update '相关工作'
相关工作:
- 实验室通过两年多的建设,已经初步搭建了一套数据基础设施,用来采集、存储、分析 GitHub上面的全域日志数据:https://github.com/X-lab2017/open-digger
- 类似 https://libraries.io/ 这样的平台,提供了不同语言下项目之间的依赖信息,为风险模型的建立提供了有用信息;
- Linux Foundation 旗下的 OpenChain 项目,提供了包括许可证兼容与合规方面的标准与工具;
- Linux Foundation 和 Google 也联合发起了类似 Slsa 的开源项目,为软件供应链的安全问题提供支持。
GIthub 的 Dependabot 支持:
- 实时监控 repo,并保证所有 dependency updated;
- 检测 vulnerable dependencies;
TODO: 这一项目需要好好研究其内容。
How to Generate an SBOM with Free Open Source Tools 中描述了一种软件物料表(Software Bill of Materials,SBOM),其中包含了诸多相关工具,可供参考;#TODO: 研究这一篇文章提及的几个软件;
Dependency track 包含一个 SBOM 分析平台,可供参考;#TODO: 研究这一篇文章的支持原理;
2. repo_stargazers_count (仓库被 star 的次数)
# star conndb.sql = '''SELECT repo_name, COUNT() AS stars FROM github_log.events WHERE type = 'WatchEvent' GROUP BY repo_name ORDER BY stars DESC LIMIT 20 ''' conndb.execute() rs = conndb.df_rs print(rs) [Out] repo_name forks 0 996icu/996.ICU 270415 1 vuejs/vue 227612 2 sindresorhus/awesome 227307 3 FreeCodeCamp/FreeCodeCamp 224461 4 kamranahmedse/developer-roadmap 217349 5 facebook/react 214558 6 jwasham/coding-interview-university 209954 7 donnemartin/system-design-primer 196127 8 tensorflow/tensorflow 193686 9 freeCodeCamp/freeCodeCamp 179361 10 EbookFoundation/free-programming-books 171178 11 getify/You-Dont-Know-JS 162998 12 flutter/flutter 159709 13 TheAlgorithms/Python 152362 14 public-apis/public-apis 152056 15 trekhleb/javascript-algorithms 151493 16 danistefanovic/build-your-own-x 143228 17 torvalds/linux 138606 18 vinta/awesome-python 138026 19 jackfrued/Python-100-Days 136214
根据 https://github.com/X-lab2017/open-digger/issues/914 ,修改 star 部分的代码为
SELECT repo_id, max(repo_stargazers_count) as repo_stargazers_count
FROM github_log.events WHERE type = 'PullRequestEvent'
group by repo_id
同样根据 https://github.com/X-lab2017/open-digger/issues/914#issuecomment-1188525198 ,我们发现这一计算方式是错误的,至少其精确度受到 pull request 频率的影响。
需要探索出一个更好更高效的统计 stars 的方法。
这个还有最新的进展么,我挺赶兴趣的
Description
问题描述:
相关工作:
研究内容:
以上是部分想法,仅供参考,晚些再补充些参考文献~