LLNL / scraper

Python library for getting metadata from source code hosting tools
MIT License
49 stars 23 forks source link

Fix GitHub repository cloning #63

Closed mcdonnnj closed 2 years ago

mcdonnnj commented 2 years ago

This pull request fixes GitHub repository cloning so that it is functional again.

Per this blog post, and this blog post for background, GitHub has disabled the unencrypted Git protocol for cloning repositories. Since this project was not updated and is still using git:// URLs to clone GitHub repositories it is unable to correctly work with them. Additionally it silently fails on this issue and unless SLOC/labor_hours are checked it runs with no error output.

The repository cloning bug is resolved by using the repository.clone_url attribute to get an HTTPS URL to use for cloning. Additionally the return code of commands run with the scraper.util.execute() function is checked and an error message is logged if it is not 0. Another logging message was changed from logging.debug() to logging.error() to reflect that a failure there impacts core functionality.

Running against the main branch:

$ scraper --config scraper.json
2022-08-19 14:18:20,668 - INFO: Connected to: https://github.com
2022-08-19 14:18:21,038 - INFO: Processing: cisagov/pshtt
2022-08-19 14:20:34,240 - INFO: SLOC: 0
2022-08-19 14:20:34,240 - INFO: labor_hours: 0
2022-08-19 14:20:34,707 - INFO: Number of Projects: 1
2022-08-19 14:20:34,707 - INFO: Writing output to: code.json

Running against this branch without the repository URL fix:

$ scraper --config scraper.json
2022-08-19 14:21:06,586 - INFO: Connected to: https://github.com
2022-08-19 14:21:06,990 - INFO: Processing: cisagov/pshtt
2022-08-19 14:23:17,839 - ERROR: Error Executing: command=git clone --depth=1 git://github.com/cisagov/pshtt.git /tmp/tmpqsep80l1/clone-dir, returncode=128
2022-08-19 14:23:18,041 - ERROR: Error Decoding: url=git://github.com/cisagov/pshtt.git, out=b'\n1 error:\nUnable to read:  /tmp/tmpqsep80l1/clone-dir\n'
2022-08-19 14:23:18,042 - INFO: SLOC: 0
2022-08-19 14:23:18,042 - INFO: labor_hours: 0
2022-08-19 14:23:18,471 - INFO: Number of Projects: 1
2022-08-19 14:23:18,473 - INFO: Writing output to: code.json

Running against this branch:

$ scraper --config scraper.json
2022-08-19 14:18:00,587 - INFO: Connected to: https://github.com
2022-08-19 14:18:00,979 - INFO: Processing: cisagov/pshtt
2022-08-19 14:18:01,821 - INFO: SLOC: 2380
2022-08-19 14:18:01,822 - INFO: labor_hours: 1159
2022-08-19 14:18:02,208 - INFO: Number of Projects: 1
2022-08-19 14:18:02,208 - INFO: Writing output to: code.json
IanLee1521 commented 2 years ago

Thanks @mcdonnnj !