Create sankey diagram starting from loading the list of NLnet repos until we get as many test counts as practical

julianharty commented 2 months ago

Context

This extends the work in #53, and is the first of our visual reports.

To implement this we may want/need to revise how and where the various interim results are written e.g. for duplicate repos and those that have incomplete URLs.

Location of the report

Let's generate/save the sankey figures in https://github.com/commercetest/nlnet/tree/main/reports/graphs/sankey-diagram-of-analysis. For now, we can overwrite any existing report in that location (which is what the pytest report does) and commit updates to the repo as git will preserve the history of the updates. We may choose to consider alternatives once we get the reports working.

tnzmnjm commented 2 months ago

learnt about Sankey. Tried to create nodes but then had problems with the linking. Started by creating 2 nodes and one link and it works.
node_labels = ['Original Data', 'After Removing Duplicates']

julianharty commented 2 months ago

Source: https://excalidraw.com/#json=2Tm8zQNYoMsjyprmWu54A,1xL5i112-bvprQ5RMm4WCw or https://excalidraw.com/?#json=WOhT64qVRYw4u3OXHKKIV,lKwcdN0p7xq6-DDW5fmnMg

The above figure provides some ideas on how we might structure the columns of the sankey diagram. It illustrates how the test pass rate might be obtained for each project. It doesn't go into any detail about how we'd actually make this work as that's a significant and distinct topic by itself.

So far we've managed to get elements of the data to be able to provide counts of filenames that include test somewhere in the filename; level 2a in this figure. I believe we can also detect test runner scripts fairly easily - level 2b in this figure. There are likely to be gaps in what we detect automatically and we may want to invest time in reducing gaps to a practical minimum.

julianharty commented 2 months ago

Some additional notes on Sankey diagrams that might be helpful if/when we refine the diagram:

https://stackoverflow.com/questions/76509707/plotly-sankey-diagram-how-to-display-the-value-for-each-links-and-node-on-the-l includes various links on how to improve plotly sankey diagrams.
How to do a Sankey Plot in Python short article that helps make the diagrams more attractive. Values are hard-coded so worth incorporating this with some of the concepts in subseqent articles.
‘Sankeying’ with Plotly A well written tutorial that includes counts at the start and end of the stages in the diagram.
Understanding Plotly Sankey Diagrams. A multi-stage tutorial that teaches the basics then moves into ways to handle and process more complex data to generate sankey diagrams programmatically.
[Further Adventures in Plotly Sankey Diagrams](https://medium.com/@twelsh37/further-adventures-in-plotly-sankey-diagrams-fdba9ff08af6 Another well written tutorial that demonstrates how to customise a ployly sankey diagram visually. Well worth reading.
How to Automatically Generate Data Structure for Sankey Diagrams A short article that presents helper functions that process Pandas Dataframes to generate the data structure for Sankey diagrams. Good work.
Deep Dive on Sankey Diagrams I've already referenced this; adding it so causal readers will find it and read this article too.
Make Your Matplotlib Plots Stand Out Using This Cheat Sheet Not directly applicable as it's written for Matplotlib rather than plotly; nonetheless the concepts and visual improvements are very clear and attractive so worth considering how we might apply them in our diagrams.

tnzmnjm commented 2 months ago

Working on the section 2b of the diagram --> Detecting test runners

Detecting Test Runners in a Repository:

Look for Configuration Files: Many test runners require configuration files where you define which tests to run and how to run them. Examples include package.jsonfor npm-based projects, pytest.ini for pytest, or mocha.opts for Mocha.
Search for Dependencies: Test runners are often installed as dependencies in the project. Check the package.json or requirements.txt files for packages related to test runners.
Examine Scripts: Developers often create scripts in the project's package.json or other script files to run tests. Look for scripts like test, test:unit, or test:ci, as these are commonly used to invoke test runners.
Check Build Pipelines: Continuous Integration (CI) pipelines often include steps to run tests. Inspect the configuration of CI tools like Jenkins, Travis CI, CircleCI, or GitHub Actions to see how tests are executed.
Explore Test Directories: Test files are usually organised in dedicated directories like tests, specs, or__tests__. The presence of such directories indicates the use of a test runner.
Review Documentation: Sometimes, developers document how to run tests in the project's README or documentation files. Look for sections related to testing or running the test suite.

commercetest / nlnet

Create sankey diagram starting from loading the list of NLnet repos until we get as many test counts as practical #56

Context

Location of the report