Open Suvodeep90 opened 4 years ago
1) Detecting similar repositories on GitHub (simpler and can be applied to any project. But doesn't contain any information on if collecting a project would be beneficial to performance) Detecting similar repositories on GitHub.pdf 2) Detecting Similar Software Applications (Based on API uses. Only applicable for Java) Detecting_similar_software_applications.pdf 3) MUDABlue: An Automatic Categorization System for Open Source Repositories (Based on code similarity. Not sure how it works yet) MUDABlue.pdf 4) An automated approach to assess the similarity of GitHub repositories (A graph-based approach) An automated approach to assess the similarity of GitHub repositories.pdf
method 5: random
What has been completed(Done) 1) ~1200 more projects done (1000 python and 200 Fortran). Total number of projects done are ~4000. 2) Tried to compare how the metrics compare in a release based setting, but with many projects, the release information is not stable (e.g. not enough release, releases are very different in lengths, releases do not contain bugs, etc.). 3) Tried to see how static these metrics are. I used the Spearman correlation to see how metrics change from one commit to another. 4) GENERAL paper submission.
What are you working on(Doing) 1) Tring to code up the metrics from the paper. I have coded up some of them. 2) Literature review on finding how to check if two projects are similar or not. This will help in answering the question if we need to collect a certain project. 3) Data Collection (1000 new projects being mined (200 pythons, 600 Java, and 200 Fortran). 4) Collecting data for project comparisons based on "Detecting similar repositories on GitHub".
What you will be working on(To Do) 1) Work on rest process metrics collection. 2) Running existing RQs with new data. 3) Experiment on what information can we use to accurately say what to collect and what not to."Detecting similar repositories on GitHub" only uses project readme and who stared the project to say if two projects are similar in nature or not. They don't have any information on whether it will contain more information regarding bugs or not.
Any Roadblocks Questions regarding new process metrics - 1) ADEV is the number of developers who changed the file. DDEV is the cumulative number of distinct developers that contributed to this file up to this release. How are they different? 2) NADEV, NDDEV, NCOMM, and NSCTR - For a given file F and release R, these metrics are based on the list of files co-committed with F, weighted by the frequency of co-commits during R. What does the "weighted by the frequency" mean? Is it just multiply the values with the weights or is it kind of normalization as well?
Additional context Add any other context about the problem here.