The goal is to produce a .ttl file which has all the information of the repourls. Previously rows with null values, dupliates, or incomplete urls were dropped but in this approach no rows will be dropped and any history of actions on the rows will be captured in a new column(explanation).
The other addition is writing the specific data with a specific format in a JSON Lines file which will be used for creating a Sankey diagram @(#56)
The script utils/initial_data_preparation.py will produce a dataframe with columns : ['projectref', 'nlnetpage', 'repourl', 'duplicate_flag', 'null_value_flag', 'repodomain', 'domain_extraction_flag', 'incomplete_url_flag', 'base_repo_url', 'base_repo_url_flag']
The script src/github_repo_request_local.py will add these 4 columns : 'testfilecountlocal', 'clone_status', 'last_commit_hash', explanation
This df is then passed to the function dataframe_to_ttl from the script utils/export_to_rdf.py which now have all these columns information .
The goal is to produce a .ttl file which has all the information of the
repourl
s. Previously rows with null values, dupliates, or incomplete urls were dropped but in this approach no rows will be dropped and any history of actions on the rows will be captured in a new column(explanation
).The other addition is writing the specific data with a specific format in a
JSON Lines
file which will be used for creating a Sankey diagram @(#56)The script
utils/initial_data_preparation.py
will produce a dataframe with columns :['projectref', 'nlnetpage', 'repourl', 'duplicate_flag', 'null_value_flag', 'repodomain', 'domain_extraction_flag', 'incomplete_url_flag', 'base_repo_url', 'base_repo_url_flag']
The script
src/github_repo_request_local.py
will add these 4 columns :'testfilecountlocal', 'clone_status', 'last_commit_hash', explanation
This df is then passed to the function
dataframe_to_ttl
from the scriptutils/export_to_rdf.py
which now have all these columns information .