Open julianharty opened 9 months ago
As per our discussion, I will start working with the Github REST API to :
I will be limiting the domain to github.com.
repo_request.py
- script for finding the number of files that include "test" in their names within a GitHub repository. github_df
'X-GitHub-Api-Version': '2022-11-28'
search_query = f'test in:path repo:{repo_path}'
payload = {'q': search_query, 'type': 'code'}
search_url = f'https://api.github.com/search/code'
response = requests.get(search_url, headers=headers, params=payload)
The output is:
Number of "test" files found: 2
File Name: test_unused_attr.c, Path: cmake/tests/test_unused_attr.c
File Name: test_format_attr.c, Path: cmake/tests/test_format_attr.c
Considerations:
repo_request.py
feature-count-test-files
, pushed the changes to the remote reposearch_query = f'test in:path -filename:.txt -filename:.md repo:{repo_path}'
)Decision needs to be made regarding below edge cases:
There are some repourl links that have only the username but no repo name. Example:https://github.com/nixcloud
→ script result is 0. One solution might be to go through these cases and find the repo which is related to the nlnet project (which we have the links). I also checked to see if removing the ‘ / ’ might have caused this problem but it’s not the case.
Username change → some usernames have changed (warner/magic-wormhole
changed to magic-wormhole/magic-wormhole
) → script result is 0. Solution can be to consider the redirecting message received from the response and if required change the username.
excluding specific extension file formats from the result:
Couldn’t find a reason why certain files were in the script result:
User used some tools for testing:
milestones/M3/M3.md
- Ran the LDP test Suite produced by W3C WGOther considerations :
vulnerabilities/tests/test_data/gitlab/gem.yaml
test
manually, I found this message: This repository's code is being indexed right now. Try again in a few minutes.
for repo:turkmanovic/OpenEPT
. I’m not sure how long it took but when I checked after a couple of hours and ran the script again we both found 0 results. We might have to put some measures in place to be able to handle these scenarios. Making sure we run our script again for that particular repo (maybe for the ones which we get 0 result).get_test_file_count
to resolve the pagination problem. → After applying the function to the 569 row, I can see the returned total number of test files are: 240 but I’m getting a message : Failed to search, status code: 403 Response text: {"message":"API rate limit exceeded for user ID ..}
Requests-Ratelimiter
module to resolve this problemprint
commands with loguru
src
and data
repo_request.py
to the src
directory and the csv files in the data
X-RateLimit-Limit:
was not showing any results, modified the logger.info
's parameter to be one stringRequests-Ratelimiter
package(LimiterSession(per_second=5)
I still got 30 resultssession = LimiterSession(per_second=0.2)
search_url = f"https://api.github.com/search/code?q=test+in:path+-filename:.txt+-filename:.md+-filename:.html+-filename:.xml+-filename:.json+repo:{repo_path}&page={page}"
LimiterSession(per_second=0.1)
make_github_request(search_url=search_url, session=session, headers=headers)
from the get_test_file_count(repo_path, headers)
LimiterSession(per_minute=10)
https://nlnet.nl/project/AccessibleSecurity,https://github.com/osresearch/heads/issues/540,-1
index_of_github = parts.index('github.com')
--> repo_path = '/'.join(parts[index_of_github + 1: index_of_github + 3])
Analysing repo http://www.github.com/asicsforthemasses extract_reponame_and_username:45 - ['http:', '', 'www.github.com', 'asicsforthemasses']
/(?<=github.com/)([a-z-_.0-9]+)/?([a-z-_]+)
['Unnamed: 0.3', 'Unnamed: 0.2', 'Unnamed: 0.1', 'Unnamed: 0', 'projectref', 'nlnetpage', 'repourl', 'testfilecount']
. This happened because my script stopped andI started it again when I faces problems and each time, pandas
aded an index column. I'm removing them from the df and saved this fataframe in '../data/github_df_test_count.csv'
Context
Approximately 60% of the projects sponsored by NLnet are currently hosted on github.com GitHub also provides mature query mechanisms so it's likely to be a useful early iteration to provide insight and feedback on our proposed objective of assessing the testing and automated tests performed by the project teams for these projects.
Further info
The wiki on this repo provides heuristics and some notes on querying GitHub using URL query parameters https://github.com/commercetest/nlnet/wiki