I have created the script utils/initial_data_preparation.py . This script reads a TSV file into a Pandas DataFrame, performs data cleaning and preprocessing steps, and saves the results as CSV files. It also checks for null values and duplicate rows in the DataFrame.
In addition to this, the rows from the original df where the code is hosted in the github.com domain is extracted and saved (data/original_github_df.csv).
I have amended the scripts src/github_repo_requests.py and src/github_repo_request_local.py to get their input file from the data/original_github_df.csv and create separate dfs when adding a new column for the count of test files.
Could you please review the changes and let me know if further amendments are required?
HI Julian,
Please see below :
I have created the script
utils/initial_data_preparation.py
. This script reads a TSV file into a Pandas DataFrame, performs data cleaning and preprocessing steps, and saves the results as CSV files. It also checks for null values and duplicate rows in the DataFrame.In addition to this, the rows from the original df where the code is hosted in the github.com domain is extracted and saved (
data/original_github_df.csv
).I have amended the scripts
src/github_repo_requests.py
andsrc/github_repo_request_local.py
to get their input file from thedata/original_github_df.csv
and create separate dfs when adding a new column for the count oftest
files.Could you please review the changes and let me know if further amendments are required?
Thanks