Analysis of the opensource codebases of NLnet sponsored projects.
This project offers several forms of analysis including:
guesslang
utility. (Note: we're using a fork of the guesslang project as that hasn't been updated to support more recent version of python.)The main objective is to be able to identify characteristics of existing and current testing practices recorded in the opensource repos of projects that have received funding from NLnet foundation. These details may then enable us to identify ways to help distill approaches that may help several of these projects in tandem (concurrently).
The columns are: project code, public page, code repository
Some projects have multiple repos, these are on their own row in the dataset.
The source file is in TSV (Tab Separated Values) format.
In general, much of the work will be identified in this repo's https://github.com/commercetest/nlnet/issues, and various more general notes will be recorded in Wiki pages at https://github.com/commercetest/nlnet/wiki
After the nlnet
repository has been cloned, two options are available for configuring the repository structure based on the intended use of the provided .tsv
file:
.tsv
file should be placed directly in the data
folder within the project repository. This setup allows the scripts to be run without additional configuration..tsv
file. For example, a directory named nlnet_data
may be created. Once the .tsv
file is placed in this new directory, the path must be specified when running the script. The --input-file
flag should be used followed by the relative path to the .tsv file
, as shown below:
--input-file nlnet_data/tsv_file_name.tsv
Informative docstrings detailing the functionality and the supported command-line arguments have been provided at the beginning of each script. Additionally, the Scripts
section in the README.md offers further guidance on using these arguments.
I'm using miniforge to manage the python environment including packages.
conda create --name commercetest-nlnet python=3.10 pandas
conda activate commercetest-nlnet
pip install -r requirements.txt
For GitHub API queries this project uses a Personal Access Token (PAT).
Generated a PAT for authentication with the Github API (Expiration 90 days Scopes → public_repo)
Chose the endpoints:
https://github.com/settings/tokens?type=beta
We'd like to learn by doing, this includes experimenting with various code quality tools and techniques. Currently we're experimenting with ruff
, pre-commit checks, and using pytest
to generate test reports which are then post-processed to provide a coverage badge.
pytest --junit-xml=reports/junit/junit.xml --html=reports/junit/report.html
genbadge tests --output-file reports/junit/tests-badge.svg
pytest --cov-report xml:reports/coverage/coverage.xml --cov .
genbadge coverage --output-file reports/coverage/coverage-badge.svg
The pre-commit hooks are run automatically by git
. They can also be run interactively https://pre-commit.com/#4-optional-run-against-all-the-files which can help to reduce the number of aborted commits for whitespace and other minor issues. Here's an example of how to run it:
pre-commit run --all-files
initial_data_preparation.py:
This script processes a TSV file to generate a DataFrame from which it extracts domains and organises entries into separate DataFrames based on these domains. Each domain is stored as a record. Each domain-specific DataFrame is saved as a CSV file if it contains more than 10 records. For domains with fewer than 10 records, the DataFrame is saved as 'other_domains.csv` .This approach ensures that data is systematically organised and readily accessible for further analysis.
Designed to be highly flexible, the script supports command-line arguments, allowing users to specify custom paths for the input TSV file and the output directories. This makes the script ideal for integration into automated workflows where paths may vary.
--input-file
: Specifies the path to the input TSV file.--output-folder
: Specifies the directory where output CSV files and other results will be saved.To use this script, you can specify all necessary command line arguments based on your requirements. For example:
python initial_data_preparation.py -input-file path/to/input_file.tsv
--output-folder path/to/output_directory
git_utils.py:
This utility module enhances Python scripts by providing robust functions for managing and interacting with Git repositories. It's designed to facilitate the identification of the Git repository's root and to ensure that scripts operate relative to the correct directory, especially in environments where the execution context might vary.
Git Repository Root Detection: Determines the root directory of the current Git repository using the git rev-parse --show-toplevel command. This is crucial for scripts that need to operate relative to the repository root. Context-Aware Path Management: Offers a method to reliably determine the appropriate base directory for file operations, whether inside a Git repository or not, thus enhancing script portability and robustness.
The functions in this script has been imported in other scripts where nteracting with Git repository is reqired.
from utils.git_utils import get_working_directory_or_git_root
work_dir = get_working_directory_or_git_root()
print(f"Working directory or Git root: {work_dir}")
github_repo_request_local.py:
This script automates the process of cloning GitHub repositories listed in a CSV file, counts the number of test files in each repository, and saves both the count and the last commit hash back to the CSV. Additionally, it writes the repository URL followed by the names of all test files found within that repository to a specified text file, facilitating detailed record-keeping and auditing of test file existence across repositories. The script is designed to handle interruptions and errors more robustly by independently verifying the completion of each critical operation including cloning, commit hash retrieval, test file counting, and the writing of test file records. It saves progress incrementally and can resume where it left off, ensuring that data from previous runs is properly managed.
This script includes several enhancements to improve its functionality:
Users can customize their experience with the script through several command-line arguments:
--exclude
: Specify file extensions to exclude from test file counts.--clone-dir
: Set a custom directory for cloning the repositories.--keep-clones
: Option to retain cloned repositories after processing.--input-file
: Path to the input CSV file.--output-file
: Path to the output CSV file that includes test file counts and last commit hashes.--test-file-list
: Path to the text file for recording repository URLs and test file names.--ttl-file
: Path to save the Turtle (TTL) format file.To use this script, you can specify all necessary command line arguments based on your requirements. For example:
python script.py --exclude ".md,.txt" --clone-dir "/path/to/clone/directory" --input-file "repos.csv" --output-file "output.csv" --keep-clones
github_repo_requests.py:
This script facilitates the analysis of GitHub repositories by interfacing with the GitHub API. It processes a CSV file containing repository URLs, counts the number of test files within each repository, fetches the latest commit information, and updates this data back into the CSV file.
sankey_diagram_plotly.py:
This script visualises the usage of various test runners across cloned repositories. It scans repositories for test patterns and dependencies, and visualises the data using a Sankey diagram in a web browser. Please see an example of the resulting diagram below.
The script is intended to be run with command line arguments specifying paths for input data and output. Command-line usage example:
python sankey_diagram_plotly.py --clone-dir=data/cloned_repos
--input-file=data/input.csv --output-file=data/output.csv`
supabase_db_interactions.py
This script demonstrates how to interact with a Supabase database using Python.
It utilises the dotenv
library to load environment variables
from a .env
file and the supabase-py
library to perform database
operations.
The script includes functions to write data to and read data from a Supabase
table.
Command-Line Arguments:
--logfile-path
: Specify the path to the logfile. If not provided, it defaults to supabase/write_to_db.log
in the working directory ( if the current directory is not within a Git repository) or the git root ( the path to
the top-level directory of the current Git repository).
Example:
python supabase_script.py --logfile-path "/path/to/logfile.log"
Environment Variables:
SUPABASE_URL
: The URL of the Supabase instance.SUPABASE_KEY
: The API key for accessing the Supabase instance.Your .env
file should look more-or-less like this depending on your actual values for these two environment
variables.
SUPABASE_URL=https://your-supabase-url.supabase.co
SUPABASE_KEY=your-api-key
guesslang_to_db.py
This script provides functionality to analyse files within cloned repositories, detect their programming languages using the guesslang library, and write the results to a Supabase database.
Run the script with the appropriate --clone-dir
argument specifying the
directory where repositories are cloned or let it use the default
directory (located in the cloned_repos
folder under the data
directory).
Example:
python script.py --clone-dir /path/to/cloned_repos
Work on the data analysis of NLnet projects is also maintained in: https://codeberg.org/NGI0Review/harvest (and the test coverage tracked online at https://artifacts.nlnet.nl/harvest/main/coverage/). In future some or all of this repo's work may migrate there, for the moment this repo facilitates exploration and experimentation.