Open RichardLitt opened 1 year ago
I'm currently searching for files in the root of the repos called CITATION.*
and other similar formats (full list here) (example), there's a python library that can convert and read them here: https://github.com/citation-file-format/cffconvert
Each repository I've scanned has a list of interesting files in the metadata
field, including citation
, example: https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandialabs%2FpvOps
{
"uuid": "289032705",
"full_name": "sandialabs/pvOps",
"owner": "sandialabs",
"description": "A set of documented functions for supporting operations research of photovoltaic energy systems. ",
"archived": false,
"fork": false,
"pushed_at": "2023-10-05T21:45:42.000Z",
"size": 37105,
"stargazers_count": 11,
"open_issues_count": 21,
"forks_count": 9,
"subscribers_count": 3,
"default_branch": "master",
"last_synced_at": "2023-10-06T16:27:47.550Z",
"etag": null,
"topics": [],
"latest_commit_sha": null,
"homepage": "https://pvops.readthedocs.io/en/latest/",
"language": "Jupyter Notebook",
"has_issues": true,
"has_wiki": null,
"has_pages": null,
"mirror_url": null,
"source_name": null,
"license": "other",
"status": null,
"scm": "git",
"pull_requests_enabled": true,
"icon_url": "https://github.com/sandialabs.png",
"metadata": {
"files": {
"readme": "README.md",
"changelog": null,
"contributing": null,
"funding": null,
"license": "LICENSE",
"code_of_conduct": "CODE_OF_CONDUCT.md",
"threat_model": null,
"audit": null,
"citation": "citation.CIF",
"codeowners": null,
"security": null,
"support": null,
"governance": null
}
},
"created_at": "2020-08-20T14:48:48.000Z",
"updated_at": "2023-10-05T21:03:28.000Z",
"dependencies_parsed_at": "2023-09-28T23:56:15.789Z",
"dependency_job_id": null,
"html_url": "https://github.com/sandialabs/pvOps",
"commit_stats": {
"total_commits": 388,
"total_committers": 11,
"mean_commits": 35.27272727272727,
"dds": 0.6469072164948453,
"last_synced_commit": "fe6e6579239ab161908fc5b0d3819b720c161f3c"
},
"previous_names": [],
"tags_count": 11,
"repository_url": "https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandialabs%2FpvOps",
"tags_url": "https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandialabs%2FpvOps/tags",
"releases_url": "https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandialabs%2FpvOps/releases",
"manifests_url": "https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandialabs%2FpvOps/manifests",
"owner_url": "https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sandialabs",
"download_url": "https://codeload.github.com/sandialabs/pvOps/tar.gz/refs/heads/master",
"host": {
"name": "GitHub",
"url": "https://github.com",
"kind": "github",
"repositories_count": 163723659,
"owners_count": 8650178,
"icon_url": "https://github.com/github.png",
"host_url": "https://repos.ecosyste.ms/api/v1/hosts/GitHub",
"repositories_url": "https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories",
"repository_names_url": "https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names",
"owners_url": "https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"
}
}
This is great. When you say "I am currently searching", can you elaborate on what you mean?
I guess I mean "detecting" rather than "searching" mostly
For each repository that is discovered and analysed (currently at 170 million), I look for certain kinds of files that have specific means in open source software, like the readme, license, changelog, code of conduct files.
So that means that every package in the packages service that references a repository, should have the name/path of the citation file present in it's metadata.
Right now, I don't believe the script is using citation.cff files at all, although it does check for them. It ought to do something with them.