Improve the initial data preparation and reporting

julianharty commented 2 months ago

Context

Currently we have three scripts connected by intermediate data files. Some processing is duplicated in the code that queries repos remotely on github.com and the code that clones repos locally before performing local analysis.

We also don't provide much reporting of entries that lack data or that are no longer available at the specified URL.

Code such as:

filter_out_incomplete_urls(...) seems better located in the data preparation than in the data processing.
check_and_clean_data(...) currently doesn't clean the data and it only provides a very brief summary of the issues it discovers.

julianharty commented 2 months ago

In terms of reporting; I'd like to experiment with both data-reporting and visual-reporting.

Data reporting would include RDF files, probably in turtle format, and formats that are easy to process further e.g. as Dataframes and/or in online services such as Big Query.

Visual reporting could include graphs, plots, and especially Sankey diagrams: https://python-graph-gallery.com/sankey-diagram/ which may help to make data quality issues easy to spot; and also - for repos that are successfully queried - groupings of results such as the count of test files, tests, etc.

tnzmnjm commented 2 months ago

added the functions filter_out_incomplete_urls , get_base_repo_url to the script utils/initial_data_preparation.py
changed the imports in the script tests/test_github_repo_request_local.py accordingly - Import the functions filter_out_incomplete_urls and get_base_repo_url
from their new location utils.initial_data_preparation- ran the tests --> all passed successfully

tnzmnjm commented 2 months ago

Progress Update

reviewed the file where I was investigating some of the issues I encountered when I ran the scripts
reviewed related issues (#3 , #11, #53)
so far the issues can be grouped as below:
- some repourls have the owner but not a repo
- there were cases where some repos were not cloned as they have been indexed when the scrip tried to clone them.
- null values --> the .tsv doesn't have any null values, but we're returning null values when fetching last commit hash fails or when a repourl is invalid (if not isinstance(url, str), (len(parts) < 5), if the URL is None or empty, Invalid URL formats, urls with Single quotes, Angle brackets, etc)
- some repourls relate to 2 nlnet project links
- getting an exception cannot clone exit with exit with status code 280. I believe this is because these repos are huge (they have lots of contributors as well). There is an attempt to clone them(I see the repo folder appear and then disappear before throwing this error)
- To start providing a report on these issues, I can create a dataframe called problematic_repos (which later on can be converted to the .ttl format if required) and the columns would be repourl, repo_presence, null_issues , clone_success, 'issue_description'
discussed these with Julian.

commercetest / nlnet

Improve the initial data preparation and reporting #53

Context

Progress Update