Could you please review the changes I have made and let me know if any improvement is required?
The changes are:
I have divided the function check_and_clean_data(df) into 2 functions remove_duplicates(df) and remove_null_values(df) in the utils/initial_data_preparation.py script
There is a work in progress regarding applying these 2 functions on the original_df. At the moment I am not applying them. I am importing these functions into the script src/sankey_diagram_plotly.py as I need what they return
The script was saving dataframes for the repos which had more than 10 repos. I have added a section to create a dataframe for all other repo domains which have fewer repos. The dataframe is called other_domains.csv.
I have added 'repodomain' column to 'original_df' to categorise data by hosting domain platforms
Implemented new nodes in the Sankey diagram for 'Duplicates', 'No Repo Name'
As per our discussion, more work is required in order to enable us to proceed with the sankey diagram and adding more nodes (regarding the successful or failing of cloning the repos). This will be addresed in another issue.
Hi Julian,
Could you please review the changes I have made and let me know if any improvement is required?
The changes are:
check_and_clean_data(df)
into 2 functionsremove_duplicates(df)
andremove_null_values(df)
in theutils/initial_data_preparation.py
scriptoriginal_df
. At the moment I am not applying them. I am importing these functions into the scriptsrc/sankey_diagram_plotly.py
as I need what they returnother_domains.csv
.As per our discussion, more work is required in order to enable us to proceed with the sankey diagram and adding more nodes (regarding the successful or failing of cloning the repos). This will be addresed in another issue.
Thanks