Open peter0083 opened 7 years ago
I have been receiving some questions regarding the dataset, prizes, and submission method and I wanted to share answers with you all to clarify. Here are some of the questions and my answers to them:
d_r_uuid is open source project id dws, dns, and so objects are 3 features of the project (you can think of them as some sort of different features/representation that’s in the open source project id) version is the security version that’s used in the particular project (this is important because you want to make sure which version a project is using to check if the version is under certain security vulnerabilities or not) license_id is id of open source license the project complies under. (the hint is that fewer license a project uses for each project, better it is since there is lower chance of violating that license)
Deadline is March 20th Monday noon. Final submission can be sent to me by email or with any sort of attachment you used for your work. We will be expecting your code along with your 5 insights if you coded it out.
$50 gift card will be given per team if you submit your work and it has at least five insights about the data. Winning teams will receive interview offers, free lunch at our office with other data scientist alongside with free office tour.
The key is finding interesting patterns and finding ways to visualize them. I wouldn’t particularly say one algorithm is better than the other since there is no right or wrong answer. I suggest continuing to ask different questions such as “is there any better methods”, “Is this pattern intriguing?” or “How can I do better?”
Black Duck Data Challenge
University of British Colombia
March 9th 2017
Some helpful questions
What are the top (e.g., 10) projects that other projects depend upon? For instance, most of the Java projects use log4j for logging. In terms of graph, one can try to find vertices that have large number of edges pointing toward them (in-degree).
What are the top (e.g., 10) projects that depend on many other projects? For instance, node.js project has lots of dependencies (using other 3rd party libraries). In terms of graph, one can try to find the vertices with large number of edges pointing from them (out-degree).
Find the most important projects by define your own metric (other than the in/out degree, e.g., using PageRank score)? Find the most useless projects based on that metric.
Find the strongly connected components (SCC) in an Open Source graph? What insights can be gained based on the number of SCC?
What are the optimal ways to represent the multiple types of relationships? For instance, multiple types edges (having different meaning) between the pairs of vertices.
Does the Six Degree of Separation [1] also hold in Open Source similarity graph? What’s the average distance between two random projects?
Reference: https://en.wikipedia.org/wiki/Six_degrees_of_separation