commercetest / nlnet

Analysis of the opensource codebases of NLnet sponsored projects.
MIT License
0 stars 0 forks source link

Enhance RDF file to include comprehensive repository url processing status #68

Closed tnzmnjm closed 1 month ago

tnzmnjm commented 2 months ago

The goal is to produce a .ttl file which has all the information of the repourls. Previously rows with null values, dupliates, or incomplete urls were dropped but in this approach no rows will be dropped and any history of actions on the rows will be captured in a new column(explanation).

The other addition is writing the specific data with a specific format in a JSON Lines file which will be used for creating a Sankey diagram @(#56)

The script utils/initial_data_preparation.py will produce a dataframe with columns : ['projectref', 'nlnetpage', 'repourl', 'duplicate_flag', 'null_value_flag', 'repodomain', 'domain_extraction_flag', 'incomplete_url_flag', 'base_repo_url', 'base_repo_url_flag']

The script src/github_repo_request_local.py will add these 4 columns : 'testfilecountlocal', 'clone_status', 'last_commit_hash', explanation

This df is then passed to the function dataframe_to_ttl from the script utils/export_to_rdf.py which now have all these columns information .