commercetest / nlnet

Analysis of the opensource codebases of NLnet sponsored projects.
MIT License
0 stars 0 forks source link

Bug in logic in local processing when data/ has been cleaned of interim files #58

Closed julianharty closed 1 month ago

julianharty commented 2 months ago

Context

After various updates to the codebase to improve the processing there's a path that causes the local script to fail - when the csv_file_path doesn't exist.

The error is reported as:

~/NLnet-Projects/commercetest-nlnet$ python src/github_repo_request_local.py --keep-clones --clone-dir /media/julian/A/
2024-04-26 21:40:27.471 | INFO     | __main__:<module>:162 - Excluded file extensions: .txt, .md, .h, .xml, .html, .json, .png, .jpg, .md
2024-04-26 21:40:27.474 | INFO     | __main__:<module>:167 - repo_root is: /home/julian/NLnet-Projects/commercetest-nlnet
2024-04-26 21:40:27.474 | INFO     | __main__:<module>:169 - updated_csv_path is: /home/julian/NLnet-Projects/commercetest-nlnet/data/updated_local_github_df_test_count.csv
2024-04-26 21:40:27.475 | ERROR    | __main__:<module>:185 - CSV file not found at /home/julian/NLnet-Projects/commercetest-nlnet/data/original_github_df.csv.
Traceback (most recent call last):
  File "/home/julian/NLnet-Projects/commercetest-nlnet/src/github_repo_request_local.py", line 187, in <module>
    if "last_commit_hash" not in df.columns:
NameError: name 'df' is not defined

The relevant logic is (at commit hash https://github.com/commercetest/nlnet/commit/63d716accacc6801c635332a1c28ae88fef0efa2)

   else:
        csv_file_path = repo_root / input_file

        if csv_file_path.exists():
            df = pd.read_csv(csv_file_path)
            df["testfilecountlocal"] = -1  # Initialise if first run
        else:
            logger.error(f"CSV file not found at {csv_file_path}.")

    if "last_commit_hash" not in df.columns:
    ...

The final else statement in this code snippet reports an 'error' (which isn't necessarily an error from a user's perspective) and then the code continues but the Dataframe doesn't exist, hence the program exits with the runtime error.

As this script should be able to run when none of the intermediate/working files exist (assuming python utils/initial_data_preparation.py has been run (which it has been) let's enhance the final else so that it creates a suitable dataframe, presumably using data/original.csv that was created by the initial data preparation script.

tnzmnjm commented 2 months ago
tnzmnjm commented 1 month ago

Integrated data preparation automation into the src/github_repo_request_local.py script to check for the existence of the input DataFrame before reading it. If the DataFrame doesn't exist, execute the utils/initial_data_preparation.py script to perform data preparation.

Related to the PR #76

tnzmnjm commented 1 month ago

I have changed the logic back when no csv is found. There's a message log that asks for the script utils/initial_data_preparation.py to run first.