Closed julianharty closed 2 months ago
In terms of reporting; I'd like to experiment with both data-reporting and visual-reporting.
Data reporting would include RDF files, probably in turtle format, and formats that are easy to process further e.g. as Dataframes and/or in online services such as Big Query.
Visual reporting could include graphs, plots, and especially Sankey diagrams: https://python-graph-gallery.com/sankey-diagram/ which may help to make data quality issues easy to spot; and also - for repos that are successfully queried - groupings of results such as the count of test files, tests, etc.
filter_out_incomplete_urls
, get_base_repo_url
to the script utils/initial_data_preparation.py
tests/test_github_repo_request_local.py
accordingly - Import the functions filter_out_incomplete_urls
and get_base_repo_url
utils.initial_data_preparation
- ran the tests --> all passed successfullyrepourl
s have the owner but not a reponull
values when fetching last commit hash fails or when a repourl
is invalid (if not isinstance(url, str), (len(parts) < 5), if the URL is None or empty, Invalid URL formats, urls with Single quotes, Angle brackets, etc)cannot clone exit with exit with status code 280
. I believe this is because these repos are huge (they have lots of contributors as well). There is an attempt to clone them(I see the repo folder appear and then disappear before throwing this error)problematic_repos
(which later on can be
converted to the .ttl format if required) and the columns would be repourl
, repo_presence
, null_issues
, clone_success
, 'issue_description'
Context
Currently we have three scripts connected by intermediate data files. Some processing is duplicated in the code that queries repos remotely on github.com and the code that clones repos locally before performing local analysis.
We also don't provide much reporting of entries that lack data or that are no longer available at the specified URL.
Code such as:
filter_out_incomplete_urls(...)
seems better located in the data preparation than in the data processing.check_and_clean_data(...)
currently doesn't clean the data and it only provides a very brief summary of the issues it discovers.