Closed johnsc1 closed 4 years ago
Please add as the following format:
Filename: Function name: Required parameters: Returned parameters: If there is no return statement, what does the function do:
Filename: data_collection.py Function Name: authenticate_repository Req Params: user_token, repository_name Ret Params: repository (The PyGithub Object)
Filename: data_collection.py Function Name: retrieve_issue_data Req Params: repository (The PyGithub Object), state, contributor_data (dict form) Ret Params: contributor_data (dict form)
The following functions are for the data_collection.py file. |
Function Name | Input Parameters | Returns |
---|---|---|---|
collect_commits_hash(repo_path) | Path to repository, String | List of dictionaries of commit info | |
get_commit_average(lines, commits) | Lines and commits are int type | lines divided by commits, handles for division by zero | |
parse_for_type(name) | name of the file is a string | splits text and returns file format as string | |
get_file_formats(files) | list of strings as files names | list of strings of unique files types/formats | |
add_raw_data_to_json(path_to_repo, json_file_name) | TEMPORARY FUNCTION AND WILL BE REMOVED | No return, writes data to .json file | |
calculate_individual_metrics(json_file_name) | string of json file name | Nested dictionary of individual metrics | |
print_individual_in_table(json_file_name) | string of json file name | no output, simply prints the data |
Even though there is an additional file that our group worked on called merge_duplicate_usernames
, it is currently under rework and it will be completely changed or deleted so we have no updates regarding that file.
@koscinskic @noorbuchi Are there any more features the interface teams need information about or do any of the information regarding these features need to be updated ?
@johnsc1 There are more changes that will be coming soon, I have been working on this all day to get them done and merged to master as soon as possible. I can't guarantee that you'll get a full update with the new stuff today but I will try to get it done as soon as I can.
@johnsc1 There is a small concern that I have that you might have to consider as the command-line interface team. One of the aspects of gathering accurate data requires the user to interact to specify any duplicate usernames to get their data merged together. I have already created a function for that in the individual-metrics-lines
PR which I'm currently working on. However, I'm not sure how your team will choose to deal with this issue so I wanted to let you all know. Let me know if you have any questions or there are any requests to change the previously mentioned function.
Is it possible to create a function that can do auto-merge, and some merge method can be provided to users. If the user doesn't choose to merge, then we do data with duplicates. If the user chooses to do merge, we can call the function with the merge strategy specified.
Our task is to create pipeline with the interface. You guys can just provide functions. @noorbuchi
@liux2 I think not merging duplicate data would really skew the scores assigned to people because of how many duplicates there often is. I definitely thought about making an auto-merge function, the only issue is that there is no way to detect duplicates all the time. I tried to make a function but it often made false assumptions and merged wrong data. Also, it often merged so little that user input was still required. @gkapfham proposed the solution we have right now and I thought it would be the best approach.
@liux2 @noorbuchi Could this be solved by prompting the user in the CLI to enter the duplicate usernames, then passing a list of those usernames to this new feature? We should also try to follow the solution that @gkapfham proposed since he is the customer.
@johnsc1 I'm not sure how the CLI works, but in my approach to testing it in the main function, I printed the data repeatedly to ask the user to input the usernames they want to merge and then displayed the data again and asked them if they're done or if they want to merge other ones. I realize that this is not the best approach but I'm not sure how else to deal with this issue.
Additionally, I have refactored the Building And Testing Team's first two features in #71. The refactored code appears in the same functions, calculate_individual_metrics(json_file_name)
and print_individual_in_table(json_file_name)
, talked about by @noorbuchi in his graph above. These are in the file data_collection.py
.
For feature teams, thanks for your hard work and please let us Interface teams know when your features are done! Without features we can't really do any real work on our interfaces.
@johnsc1 @lussierc @liux2 I will post a comprehensive update on the latest changes on the data_collection
file. I haven't had the chance to make changes just yet but more updates will be coming soon and I'll put them on this issue tracker.
@MaddyKapfhammer hope to communicate with you today about the specific features of the Team Evaluation so that we can create a table for the interface team here.
From #45, just to update you all.
Main Tasks left to complete for Web Interface:
There is now a function for retrieve_token that takes in a file path and returns the user's token, provided it is stored locally in token.txt. Otherwise, if no input parameter is given or the file does not exist, it returns the Travis token. The Travis token is purely for testing and will not be able to mine the repo.
There are still some minor issues with the test cases, which is why I introduced the method in the first place. It's functionally a check/pass-through, but necessary for stable testing.
Update from the latest PR in individual metrics lines, we are trying to get it merged. This table shows the function descriptions. | Function Name | Input Parameters | Returns |
---|---|---|---|
collect_commits_hash(repo_path) | Path to repository, String | List of dictionaries of commit info | |
get_commit_average(lines, commits) | Lines and commits are int type | lines divided by commits, handles for division by zero | |
parse_for_type(name) | name of the file is a string | splits text and returns file format as string | |
get_file_formats(files) | list of strings as files names | list of strings of unique files types/formats | |
collect_and_add_raw_data_to_json(path_to_repo, json_file_name="raw_data_storage", data_path="./data/", overwrite=True) | Shortcut method that collects raw data using pydriller, then writes to json | No return, writes data to .json file | |
collect_and_add_individual_metrics_to_json(read_file="raw_data_storage",write_file="individual_metrics_storage",data_path="./data/",overwrite=True,) | This function skips calculation steps, do not use unless that's intended | No return, writes data to .json file | |
calculate_individual_metrics(json_file_name="raw_data_storage", data_path="./data/") | no parameters are necessary of using default files to read data | Nested dictionary of individual metrics | |
print_individual_in_table(file_name="individual_metrics_storage", data_dict={}, headings=["EMAIL", "COMMITS", "ADDED", "REMOVED"]) | Prints either from dictionary or file, takes a list of headings as dictionary keys | no output, simply prints the data | |
merge_metric_and_issue_dicts(metrics_dict, issues_dict) | merge dictionary with Pydriller data with dictionary with Pygithub data | return a merged dictionary | |
merge_duplicate_usernames(dictionary, kept_entry, removed_entry) | entries are data set keys intended to be merged, they are strings | return a dictionary with merged entries |
The following functions are for the data_processor.py
file specifically the TEAM EVALUATION portion
NOTE: This has not yet been merged to master
Function Name | Input Parameters | Returns |
---|---|---|
iterate_nested_dictionary(dictionary) | Nested dictionary given by add_new_metrics function |
A new dictionary with a metric ("COMMITS", "ADDED" etc.) as the key and a list of those values as the value |
calculate_iqr_score(data_list, below_weight, above_weight, within_weight) | A list of datapoints, 3 int values for calculations | A calculated iqr score for the specific list of datapoints (percentage) |
calculate_team_score(dictionary, below_weight, above_weight, within_weight) | Nested dictionary given by add_new_metrics function, 3 int values for calculations |
A calculated team score found by adding together all scores for each metrics category, and dividing by the amount of categories (an average score) (percentage) |
The function calculate_team_score()
uses the other two functions, so it is the only one that needs to be called to return a value (which is the team score). If you want to display the scores for each category, these can also be accessed with calculate_team_score()
@MaddyKapfhammer @bagashvilit When I spoke to Teona yesterday, we talked about a print method being available to put output in the terminal. I know you guys had a working function - is that something CLI team would call, or should we implement it in cogitate.py?
@cklima616 I'll explain what my functions do in a moment
As for the print function, I did not create it I just use pandas
dataframe
, you guys can import that in your file and use for the display.
@cklima616 I have not created a print method for the team evaluation functions.
@MaddyKapfhammer One other question - do you need CLI to call the calculate_iqr_score method if the user provides below/above/within weights? Or are those just going to remain hard coded/calculated some other way?
@cklima616 The CLI only needs to call the calculate_team_score method. The user does need to provide below/above/within weights for this function, as the customer said that it would be better to specify that information, than have it be hard coded.
@cklima616 If you need to print a nested dictionary, I suggest using the function in the data_collection
module that our team wrote, you will simply need to follow the parameters outlined in the function and you can get a table with all the information you need. Please let me know if you have questions on how to use it.
The following table has functions from data_processor.py file. |
function | parameters | return |
---|---|---|---|
add_new_metrics(dictionary) | parameter is dictionary you get from data_collection functions | returns an updated dictionary with new metrics such as TOTAL , MODIFIED RATIO |
|
individual_contribution(dictionary) | parameter is returned dictionary from add_new_metrics function |
returns nested dictionary where keys are the username and metrics and values are percentage of individual contribution |
When I finished working on my program to demonstrate the results I used pandas
dataframe
. All you need to do is import pandas as pd
and print(pd.DataFrame.from_dict(dictionary).T)
, the parameter here is dictionary
the dictionary that needs to be printed out. This is one of the ways to represent data
and does not have to be this way.
I refactored the Building And Testing Team's first two features in #71. The refactored code appears in the function, calculate_individual_metrics(json_file_name, data_path)
.
@bagashvilit does the "add_new_metrics' function need to be called for both team based functions and individual based functions or just before individual based functions? Also are there default values that we can use for Above_weight, Below_weight, Within_weight? or is that a question for prof.?
@JMilamber Yes because both individual and overall functions need to use updated dictionary. How I would recommend doing is to call add_new_metrics
function and once you get an updated dictionary, use that for both individual and overall functions. Please let me know if you have any further questions.
@MaddyKapfhammer Could you comment on default values for weight
@JMilamber the default values that can be used for the weights are as follows:
you could also use:
@MaddyKapfhammer Okay thank you. Those will be added.
@bagashvilit okay thank you for the update, i will move the call so both individual and team get the updated dictionary.
I'm assuming you all have figured this out already, but I'm fairly certain initialize_contributor_data() is already done by one of the functions covered by the individual metrics team, so it should be safe to remove.
@bagashvilit @MaddyKapfhammer I am receiving this error when running our program:
To see the output in the web, simply add '-w yes' to your command line arguments.
Traceback (most recent call last):
File "src/cogitate.py", line 164, in <module>
main(args)
File "src/cogitate.py", line 43, in main
new = team(dict)
File "src/cogitate.py", line 125, in team
updated = data_processor.add_new_metrics(new_dict)
File "/Users/sitstatic/cs203S2020/cogitate_tool/src/data_processor.py", line 133, in add_new_metrics
for key in dictionary:
TypeError: 'int' object is not iterable
@bagashvilit @MaddyKapfhammer Upon reviewing this error, I realize it's because of how I called the add_new_metric function. For calculate_team_score, the function returns an int and I tried to pass it through add_new_metric, which requires a dictionary. I am going to continue working this out, but if there's something I'm missing let me know!
@MaddyKapfhammer Since you worked on the team function - does it only return one single score? I was under the impression it would be some sort of score by branch/small group. If it's just one single baseline score, that's fine, I'll just have to change the code.
@cklima616 It does return one single team score. Only the individual evaluation returns dictionary.
@bagashvilit Thank you! After fixing this, here is the current output of our program:
To see the output in the web, simply add '-w yes' to your command line arguments.
Empty DataFrame
Columns: []
Index: []
0
@cklima616 I'll take a look at your branch as soon as I get the chance
https://github.com/GatorCogitate/cogitate_tool/blob/a9d23c00194a1e53a81fa5a9304def685a312a60/src/cogitate.py#L155 Here instead of this it should be:
updated = data_processor.add_new_metrics(dict)
new_dict = data_processor.individual_contribution(updated)
@cklima616 You should also double-check with data_collection team to make sure that you are getting data correctly
@noorbuchi I have implemented all functions as described, but am still getting an empty output.
To see the output in the web, simply add '-w yes' to your command line arguments.
+----------+-------+---------+-------+---------+
| Username | EMAIL | COMMITS | ADDED | REMOVED |
+----------+-------+---------+-------+---------+
+----------+-------+---------+-------+---------+
Team Score:
0
If you have a chance, could you ensure I have implemented methods for data collection correctly?
@cklima616 I will do that soon, are these files in the cogitate.py
file?
@noorbuchi Yes! Primarily in the main method, lines 34-40.
data_collection.collect_and_add_raw_data_to_json(
args["link"], "raw_data_storage.json"
)
# allows the user to enter the merge while loop if they specified to
data_collection.collect_and_add_individual_metrics_to_json()
# calculate metrics to be used for team evaluation
individual_metrics_dict = data_collection.calculate_individual_metrics()
if args["metric"] == "team":
team(individual_metrics_dict, args["below"], args["above"], args["within"])
elif args["metric"] == "individual":
individual(individual_metrics_dict)
elif args["metric"] == "both":
new_individual_metrics_dict = individual(individual_metrics_dict)
team(
new_individual_metrics_dict,
args["below"],
args["above"],
args["within"],
)
I will be specifically referring to the code mentioned above. It is from lines 34-52. Your call to collect_and_add_raw_data_to_json
is correct, however, you do not have to specify the name of the file if you're using the default one. On the other hand, collect_and_add_individual_metrics_to_json
should not be called. This is a shortcut method that writes the individual metrics to the json
file without adding any calculated data or any issue data so it skips some steps. Instead, use the calculate_individual_metrics
functions just like you have done on line 40. This function would create a dictionary by reading from the default json
files unless otherwise specified. After getting the calculated metrics dictionary, you should get the PyGithub data and use the function merge_metric_and_issue_dicts
to get a dictionary that contains all of the uncalculated information. Then you should prompt the user to merge duplicate usernames while printing the table. To do that, you can make a while true loop that exits at a specific condition. Once all the metrics are ready, you can send the dictionary to add_new_metrics
in the data_processor
module. This would add calculated metrics to the dictionary. Once you get the dictionary, even with skipping some steps, you can print it out in two ways. Either by sending the dictionary directly as a parameter like this: print_individual_in_table(data_dict=your_dictionary, headings=["EMAIL", "COMMITS", "ADDED", "REMOVED"])
this is just a list of headings that you can use, if you prefer, you can add/remove headings. Remember, headings have to be the same as keys from the dictionary. The other way of printing the dictionary is by print_individual_in_table( headings=["EMAIL", "COMMITS", "ADDED", "REMOVED"])
this will take individual_metrics_storage
as a parameter by default and read from that json
file. Make sure that the dictionary is written to the json
file before using the latter. You can do that through the write_dict_to_json_file
in the json_handler
module. I hope this was helpful, please let me know if there are any additional questions.
@noorbuchi is the PyGitHub data just the retrieveissue data? im working on this now
@JMilamber yes, retrieve_issue_data
would allow you to do that. The call would look like this retrieve_issue_data(repository, state, contributor_data)
where repository
is a repository object you can get from authenticate_repository
, state can be a string of all
open
or closed
, and contributor data is a dictionary, an empty one would work
okay sounds good. Thank you @noorbuchi ill let you know when its been updated in the branch
This is an issue where the teams working on features can provide a table of their features containing the function name, parameters, and return data so the interface teams are able to integrate the features as soon as possible.
Note: These tables need to be updated as code is refactored so the interfaces are correct.