Could you please review the scripts and let me know if any changes are required?
Please see below the changes:
Address the 'repourl's where they start with 'http' rather 'https'
Filter out rows where the URL doesn't have a repository name (83 rows) before cloning the repos
Revert the code to save the repos on the hard disk
realised the repo URLs which end with '/' at the end, was not cloned but test files were counted. Removed the'/' and can see they are cloned now
Decouple the conditions for skipping repositories to handle test file counting and commit hash fetching independently.
Modify the processing loop to always attempt fetching the last commit hash, even if test file counting was previously completed.
Include checks to clone repositories only if they do not exist, ensuring that interruptions in script execution do not prevent subsequent data capture.
Ensure consistent saving of DataFrame after every batch processing to prevent data loss.
Update the docsting at the beginning of the script to comply with the recent changes
change the utils/export_to_rdf.py script to process the whole dataframe rather that line by line
Add the capability to save the result as a .ttl RDF format
Added a function get_base_repo_url to extract the base repository URL from any GitHub link. This function parses the URL and returns only the necessary parts that point to the repository root.
Hi Julian,
Could you please review the scripts and let me know if any changes are required?
Please see below the changes:
utils/export_to_rdf.py
script to process the whole dataframe rather that line by line.ttl
RDF formatget_base_repo_url
to extract the base repository URL from any GitHub link. This function parses the URL and returns only the necessary parts that point to the repository root.Thanks