X-lab2017 / open-digger

Open source analysis tools
https://open-digger.cn
Apache License 2.0
291 stars 86 forks source link

[question] the 'groupBy' parameter problem of the 'getRepoOpenrank' function #1279

Closed PureNatural closed 1 year ago

PureNatural commented 1 year ago

Description

I can use the following method to get the top 10 openrank databases. image

But when I add the groupBy parameter, the openrank from other domains will also be counted, such as big_data, cloud_native. image

If I want to get the openrank of each subdomain under the database domain, how should I choose the parameters here?

Thanks for your reply!

PureNatural commented 1 year ago

A similar question.

image The above method can get the results for application_software domain.

If I want to get the openrank of each domain under the application_domain, how should I choose the parameters here? @frank-zsy

frank-zsy commented 1 year ago

Thanks for the report, actually this is a bug for group by function when grouping by labels.

The reason is that in this line of SQL, I used two arrayJoin to get id and name for label data since some repos may labeled by multiple labels at same time. I thought this SQL will give a corresponding id and name columns but it doesn't.

Just like the image shows:

image

The rows with red rect is the rows expected to return, but it returns more since arrayJoin gives a multiply of two arrays.

And after that, group by id column will give a random result for name and the openrank result is also multiplied by n times with n is the array length of the label.

I still can not think of a proper way to fix this right now.

frank-zsy commented 1 year ago

Just find a way to generate the corresponding id and name column by tuples with arrayJoin, I will fix this soon.

image

/self-assign

frank-zsy commented 1 year ago

Tuple can be used to generate the columns for corresponding id and names, but with another aggregation function in the SQL, ClickHouse throws an error about the items column.

I opened an issue in ClickHouse repo and wait for the response from the community. https://github.com/ClickHouse/ClickHouse/issues/49583

frank-zsy commented 1 year ago

@PureNatural I will fix the bug by #1288 , since the maintainer of ClickHouse reply my issue and give a solution.

And return of other label is still right because some repos may also have other Tech-1 level label like cloud_native or big_data, so you can filter the result to get the data you want just like this:

image

Is this fit your requirement?

frank-zsy commented 1 year ago

And for the other question, if you want to compare the data among application_domain, you can use application_domain as label and use Domain-0 to group and filter the Others row like this:

image
PureNatural commented 1 year ago

@PureNatural I will fix the bug by #1288 , since the maintainer of ClickHouse reply my issue and give a solution.

And return of other label is still right because some repos may also have other Tech-1 level label like cloud_native or big_data, so you can filter the result to get the data you want just like this:

image

Is this fit your requirement?

Thanks for your job!

I think I can finish the blue paper after https://github.com/X-lab2017/open-digger/pull/1288 is merged.

My code will also be much simpler!