X-lab2017 / open-digger

Open source analysis tools
https://open-digger.cn
Apache License 2.0
291 stars 86 forks source link

[OSPP2023] Idea list for Summer 2023 #1246

Closed xgdyp closed 1 year ago

xgdyp commented 1 year ago

Description

Hi all, we passed the OSPP2023 project review and OpenDigger has been accepted as the community of OSPP2023. image

Now we need to determine our idea list. If you have any ideas or you want to be a project mentor, please leave a comment.

Please pay attention to that we need to post our ideas before 25 April.

some references:

  1. https://github.com/X-lab2017/open-digger/issues/552
  2. https://github.com/X-lab2017/open-digger/issues/878
  3. https://github.com/X-lab2017/open-digger/issues/896
xgdyp commented 1 year ago

Idea 1

a community analysis case based on GitHub social network.

Description:In OpenDigger, we are currently missing the exploration of graph metrics. So I think if we This task is mainly to explore GitHub collaboration from the perspective of social network.

Expected outcomes: A case similar to others in notebook

Skills: python, pytorch

xgdyp commented 1 year ago

refactor OpenDigger by sql query builder

Description:Complex and long SQL is difficult to maintain, this is not conducive to the development of the project. One of the ways to solve this problem is to create SQL through SQL builder, which can help us reduce the difficulty of reading SQL, especially subquery. So we want to explore whether there is a mature framework that can help OpenDigger. In Python, we can use pypika which is a python pkg supporting some of clickhouse sql syntax. So what we need to do is to investigate in detail whether it can cover our sql, and whether it will really improve OpenDigger.

If yes, we can refactor Python kernel first as OSPP task. I think I can follow up on this project.

Expected outcomes: metrics generated by python sql builder

Skills: python typescript javascript SQL

references: https://github.com/didi/gendry https://github.com/sqlkata/querybuilder https://github.com/ibis-project/ibis https://github.com/doug-martin/goqu

xgdyp commented 1 year ago

Data Quality Assurance & Metrics Explanation by case

During the development of indicators, we need to verify whether the written sql gets the correct query results.Our previous approach was to select a repo and compare the data manually. At the same time, sometimes the data is missing, and we are often not sure whether our implements are wrong or the data is abnormal. This approach is inefficient and error-prone.

So we need some DQA mechanism to finish data-check and metric check automatically.

Further more, I think we can add a dataset and using this for metrics explanation. e.g. the result of one metric runs on this dataset is 10. and we can clearly understand which records make this metric 10, and display these records

references: https://help.aliyun.com/document_detail/116897.html

xgdyp commented 1 year ago

hi @frank-zsy , do you have any more idea? And we need to select two from them because we only have 2 spots.

frank-zsy commented 1 year ago

@xgdyp I think all the 3 tasks are great for OSPP task. But there are some tricky points about the tasks.

So I think the last 2 may work.

xgdyp commented 1 year ago

OK, for the first task, what I hope is to get a case, which includes how to build a graph from the raw logs in clickhouse, and use this graph for analysis. I think we don't need to use our own graph database (the way to build a graph network may also be different), its purpose is to provide user with a graph analysis tutorial

frank-zsy commented 1 year ago

So we don't really need a graph database and we only need to load the data from ClickHouse and build graph in memory and get the metrics? This is doable but may not lead to large scale metric data export due to the performance issue.

xgdyp commented 1 year ago

This is doable but may not lead to large scale metric data export due to the performance issue.

That's a problem but I think it should be within the acceptable range. It may focus on the analysis of some communities (like paddlepaddle hackthon) rather than all.

frank-zsy commented 1 year ago

OK, if you insist to add this task to OSPP, I am fine with it. But in the future, if we have a public graph database for global collaboration network, we may need a code refactoring for the network metric implementation.