commercetest / nlnet

Analysis of the opensource codebases of NLnet sponsored projects.
MIT License
0 stars 0 forks source link

TBD suitable techniques for querying git repos #7

Open julianharty opened 5 months ago

julianharty commented 5 months ago

Context

NLnet funds opensource projects; these projects host their code on a variety of code hosting services including github, gitlab, and others. Some of these, including github, provide mechanisms to query codebases they host. The mechanisms are likely to vary in their APIs, methods, and the data they return. We will start iteratively by using github's APIs to bootstrap the analysis of information about testing which will then help shape our understanding of the information that we find pertinent. That will then help us determine how we might obtain this information from the various code hosting services.

Objectives

Abstractions

Broadly the information can be obtained from various sources, including asking:

  1. the developer(s)
  2. the testing framework(s)
  3. the hosting provider e.g. github.com
  4. the operating system and file system
  5. git

Each provides distinct facets of information, including about the accuracy, completeness, and perception of any tests that are related to the repo. We aim to obtain answers from at least one of these for every repository supported by NLnet foundation. Where practical, the one providing the most insight will be chosen, and at least one of the code-based sources will be queried in addition to whatever perspective developers can provide.

Querying git

Much of the underlying information will be in an instance of the codebase that includes the git history. Therefore it'll be worth us investigating how we might interface with a git repo independently of where the repo is hosted, e.g. on a local cloned copy of the respective repo.

Work in this area

Source code analysis in codebases is an active area of academic research e.g. as part of Mining Software Repositories (MSR) and there are very likely to be tools and techniques we can use and apply to help us with our work.

julianharty commented 5 months ago

Filesystem queries for local instances of code repos

Local repos contain project files (and files created and maintained by git). Therefore filesystem queries can obtain pertinent information contained in the project files without requiring code that 'understands' git. Python 3 includes https://realpython.com/python-pathlib/ which seems a useful starting point e.g. to find program files that match various commonly used extensions e.g. .java for Java, .kt for Kotlin, .js for JavaScript, and .py for python. Paths can also be queried for substrings e.g. test that may indicate the existence of code intended for testing purposes; albeit there may be false positives.

This approach doesn't collect any git-related information.

julianharty commented 5 months ago

Related work

julianharty commented 5 months ago

Sense checking

To rely on on untested code on unknown codebases would be sad and poor practice. Let's cross-check using various techniques to increase our confidence in the results our software returns.

Basic sanity checks

Commands such as the nix find command can be used to search for files and folders that match text we provide. As many developers apply conventions, such as placing automated tests in a directory branch such as src/test/, we can use these as heuristics when searching for code that contain tests. Additional nix utilities e.g. grep can be used for filtering and wc to count items.

in folder: spring-petclinic/src/test find . -type f -print0 | xargs -0 file | grep -i source returns in a local clone of the repo:

find . -type f -print0 | xargs -0 file | grep -i source
./java/org/springframework/samples/petclinic/vet/VetTests.java:                           C++ source text, ASCII text
./java/org/springframework/samples/petclinic/vet/VetControllerTests.java:                 C++ source text, ASCII text
./java/org/springframework/samples/petclinic/owner/PetControllerTests.java:               C++ source text, ASCII text
./java/org/springframework/samples/petclinic/owner/OwnerControllerTests.java:             C++ source text, ASCII text
./java/org/springframework/samples/petclinic/owner/VisitControllerTests.java:             C++ source text, ASCII text
./java/org/springframework/samples/petclinic/owner/PetTypeFormatterTests.java:            C++ source text, ASCII text
./java/org/springframework/samples/petclinic/system/CrashControllerTests.java:            C++ source text, ASCII text
./java/org/springframework/samples/petclinic/system/CrashControllerIntegrationTests.java: C++ source text, ASCII text
./java/org/springframework/samples/petclinic/PetClinicIntegrationTests.java:              Java source text, ASCII text
./java/org/springframework/samples/petclinic/model/ValidatorTests.java:                   C++ source text, ASCII text
./java/org/springframework/samples/petclinic/MySqlIntegrationTests.java:                  C++ source text, ASCII text
./java/org/springframework/samples/petclinic/service/EntityUtils.java:                    Java source, ASCII text
./java/org/springframework/samples/petclinic/service/ClinicServiceTests.java:             C++ source text, ASCII text
./java/org/springframework/samples/petclinic/MysqlTestApplication.java:                   Java source text, ASCII text
./java/org/springframework/samples/petclinic/PostgresIntegrationTests.java:               Java source text, ASCII text

The file command identified many of the java files as C++ hence using grep to match source rather than Java as I'd prefer to err on including the files identified as C++ since they probably contain tests. It's possible - and may be productive - to drill into the various files to extract the names of individual tests. That's a later exercise, and not needed just yet.

We're more interested in filenames than folder names at this stage. Nonetheless find . -type d -iname '*test*' -print performs a case-insensitive search for folders/directories that have the word test in them. There may be projects that have tests in folders in a test folder branch where the filenames do not include the word test. The output of this command can be passed to xargs to perform a subsequent search for source files that might include automated tests. We'd then want to investigate the contents of those files to determine if they do actually include tests.

julianharty commented 5 months ago

Some interim thoughts

I wonder if it'd be worth us amending the data frame(s) so that they can record:

  1. the method information was obtained by
  2. the commit (and perhaps the branch if needed)
  3. the count of tests

I don't yet know enough about pandas dataframes to understand if it supports structures within cells and/or nested data. Similarly, we'll eventually need to communicate this information to NLnet using an RDF structure and that may place its own constraints on how the info can be communicated. TODO