commercetest / nlnet

Analysis of the opensource codebases of NLnet sponsored projects.
MIT License
0 stars 0 forks source link

Document structure of the data file and make the code location independent #9

Closed julianharty closed 4 months ago

julianharty commented 5 months ago

Context

The project does not contain the data file, this is provided separately. It's important to document the structure and expected location of that file in the repo so that others can use it successfully. The script should also be able to find the correct folder regardless of where it's run from. I ran the code from the root of the project and that means the folder is at ./data rather than ../data.

Changes requested

python src/repo_request.py     
2024-03-10 08:55:27.981 | INFO     | __main__:main:125 - Could not find existing file at '../data/github_df.csv', creating a new file.
2024-03-10 08:55:27.981 | ERROR    | __main__:load_data:22 - Error Loading data from path ../data/df with exception [Errno 2] No such file or directory: '../data/df'
Traceback (most recent call last):
  File "/Users/julianharty/NLnet-projects/commercetest-nlnet/src/repo_request.py", line 168, in <module>
    github_df = main()
  File "/Users/julianharty/NLnet-projects/commercetest-nlnet/src/repo_request.py", line 131, in main
    check_and_clean_data(df)
  File "/Users/julianharty/NLnet-projects/commercetest-nlnet/src/repo_request.py", line 28, in check_and_clean_data
    null_counts = df.isnull().sum()
AttributeError: 'NoneType' object has no attribute 'isnull'
(commercetest-nlnet) julianharty@MacBook-Pro commercetest-nlnet % 
julianharty commented 5 months ago

There are a several approaches that have been proposed in Find the root of the git repository where the file lives. They include:

  1. examples that use GitPython - clean and simple if we rely on GitPython being installed and available. It should work well, i.e. be reliable, assuming that GitPython is mature. e.g. https://stackoverflow.com/a/48197188/340175
  2. examples that shell out to git - relies on git directly, again this should be reliable however it relies on having shell access. e.g. https://stackoverflow.com/a/53675112/340175
  3. examples that don't use git, instead they search for indications that a git repo is embedded in the current or parent folders. e.g. https://stackoverflow.com/a/67516092/340175
    1. There's also a generic version that could work for other source control options e.g. Mercurial. https://stackoverflow.com/a/43786287/340175 However, I'm not aware of any NLnet project that doesn't use git so that's not necessary for us at this stage.

To get something working I've decided to encapsulate the third of these, that uses pathlib. I've tested it and at least one of the ones that uses GitPython. Both worked, it doesn't rely on us requiring GitPython so reduces the dependencies slightly.

julianharty commented 4 months ago

The code also needs to be able to find other code modules we've created. https://stackoverflow.com/questions/7850908/what-exactly-should-be-set-in-pythonpath provides some possible approaches to addressing this dilemma e.g. using PYTHONPATH or using pip to install local packages. TBD what approach would work well for us and for this project.

tnzmnjm commented 4 months ago

Added the blocked below in the README:

After the nlnet repository has been cloned, two options are available for configuring the repository structure based on the intended use of the provided .tsv file:

Informative docstrings detailing the functionality and the supported command-line arguments have been provided at the beginning of each script. Additionally, the Scripts section in the README.md offers further guidance on using these arguments.

tnzmnjm commented 4 months ago
julianharty commented 4 months ago

Closed by https://github.com/commercetest/nlnet/commit/8370b5921ad9a801288b2c77b7c832a5baf86f9c