Closed tnzmnjm closed 2 months ago
working on gitlab.com domain --> In gitlab there might be subgroups and then the name of the repo https://gitlab.com/technostructures/kazarma/kazarma
for these scenarios, I will need to change the function
get_base_repo_url
to return the correct output
atm the output of get_base_repo_url('https://gitlab.com/technostructures/kazarma/kazarma')
is 'https://gitlab.com/technostructures/kazarma'
for codeberg there are about 8 repos with owner but not a specified repo although when I check the repourl, I can see lots of repos there --> https://codeberg.org/interpeer
or https://codeberg.org/forgejo
repos in code hosting platform [invent.kde.org](http://invent.kde.org/)
do not need any changes in the code. They all have owner/repo
framagit.org platform : only one repourl doesn't have a specific repo https://framagit.org/incommon.cc
. When I checked the website, it has Subgroups and projects
section. All of them seems t be related to the nlnet project
for git.savannah.gnu.org/
it's interesting as the majority of repos ends with .git
which atm is being removed from the repourl to get the base repourl. This needs to be fixed. -get_base_repo_url('https://git.savannah.gnu.org/cgit/mes.git')
returns https://git.savannah.gnu.org/cgit/mes
which is not correct
Amended the function to handle these cases: --> gitlab --> get_base_repo_url('https://gitlab.com/technostructures/kazarma/kazarma') returns the correct url: 'https://gitlab.com/technostructures/kazarma/kazarma' --> codeberg --> get_base_repo_url('https://codeberg.org/interpeer') returns None --> git.savannah.gnu.org --> is now returning the correct repourl: get_base_repo_url('https://git.savannah.gnu.org/cgit/mes.git') --> 'https://git.savannah.gnu.org/cgit/mes.git'
continuing with other hosting platforms
git.sr.ht
--> do not need any changes in the code. They all have owner/repo
pagure.io
--> do not need any changes in the code. They all have owner/repo
notabug.org
--> do not need any changes in the code. They all have owner/repo
git.irde.st
--> do not need any changes in the code. They all have owner/repo - Same rep for 3 nlnet page
devel.ag-projects.com
--> do not need any changes in the code. They all have owner/repo
code.briarproject.org
--> do not need any changes in the code. They all have owner/repo
source.symbolic.software
--> do not need any changes in the code. They all have owner/repo
savannah.gnu.org
--> do not need any changes in the code. They all have owner/repo
gitlab.gnome.org
--> do not need any changes in the code. They all have owner/repo
gitlab.shinice.net
--> do not need any changes in the code. They all have owner/repo
git.torproject.org
--> added "git.torproject.org" to
hosts_with_mandatory_git_suffix = ["git.savannah.gnu.org", "git.torproject.org"]`
gitlab.torproject.org
--> it has sybgroups and reponame comes at last :https://gitlab.torproject.org/tpo/network-health/sbws
and sbws
should not be removed --> ameded th code accordingly
gitlab.lip6.fr
--> do not need any changes in the code. They all have owner/repo
https://git.marginalia.nu
--> has been moved to github 'https://github.com/MarginaliaSearch/MarginaliaSearch`
*** http://cgi.repo.hu/cgi-bin/minisvn.cgi?cmd=browse&repo=sch-rnd&path=trunk
--> Not sure how we can clone this
git.taler.net
--> added git.taler.net
to hosts_with_mandatory_git_suffix
gitlab on it's own has 41 repos but I found gitlab.shinice.net: 2 and gitlab.gnome.org: 2 which can be merged to the gitlab df if required
I expand test coverage for get_base_repo_url
with multi-platform scenarios:
GitLab
and Framagit
, validating the correct parsing of complex repository structures.Codeberg
and Framagit
, checking the function's response to URLs that do not specify a repository..git
suffix in the repository URL, ensuring the function does not incorrectly strip essential parts of the URL.hyperglitch.com
--> does not need any changes in the code. They all have owner/repogti.telent.net
--> does not need any changes in the code. They all have owner/repocode.mro.name
--> does not need any changes in the code. They all have owner/repogit.deuxfleurs.fr
--> does not need any changes in the code. They all have owner/repohg.sr.ht
--> does not need any changes in the code. They all have owner/repohydrillabugs.koszko.org
--> the repourl is https://hydrillabugs.koszko.org/projects/haketilo/repository/
which gives 403 error
Forbidden You don't have permission to access this resource.
-
Added this domain to 'direct_path_platforms'tahoe-lafs.org
--> the repourl goes to the README. https://lumosql.org/src/lumosql/doc/trunk/README.md
git.gnunet.org
--> There are several repourls : https://git.gnunet.org/gnunet.git
. https://git.gnunet.org/
https://git.gnunet.org/messenger-gtk.git/
https://git.gnunet.org/gnunet.git/log/?h=dev/thejackimonster/messenger
``. Not sure how it would get cloned.git.disroot.org
--> There are 2 repourls one of them doesn't have any reponame but to the list of repos https://git.disroot.org/Lacre
source.mntmn.com
--> does not need any changes in the code. They all have owner/repohosted.weblate.org
--> the repourl is doesn't take us to the repo. Checking the nlnet website, the repo is https://invent.kde.org/network/kaidan
-igor2.repo.hu
--> repourl goes to the whole project not a specific repobugs.otr.im
--> does not need any changes in the code. They all have owner/reporepology.org
--> does not need any changes in the code. They all have owner/repowww-soc.lip6.fr
--> When opening the repourl it redirects from https://www-soc.lip6.fr/equipe-cian/logiciels/coriolis/
to https://largo.lip6.fr/equipe-cian/logiciels/coriolis/
. On the nlnet webpage the repourl is diferent
https://gitlab.lip6.fr/vlsi-eda/coriolis
this correct one is in another line in the original df. line 186git.law
--> does not need any changes in the code. They all have owner/repoleap.se
--> reporurl https://leap.se/en/source
gives 404 Not Found . The requested URL was not found on this server.
2019-04-097, https://nlnet.nl/project/bitmask, https://0xacab.org/kali/bitmask-vpn --> Only this one is the repo
2019-04-097, https://nlnet.nl/project/bitmask, https://www.transifex.com/otf/bitmask-android/ --> 404 Not Found 2019-04-097, https://nlnet.nl/project/bitmask, https://leap.se/en/source -->
404 Not Foundsalsa.debian.org
--> does not need any changes in the code. They all have owner/repolab.nexedi.com
--> does not need any changes in the code. They all have owner/reposcm.cwi.nl
--> does not need any changes in the code. They all have owner/repogitlab.uni.lu
--> does not need any changes in the code. They all have owner/reporedmine.replicant.us
--> the second one takes us to the Replicant's source code. not sure how it will get cloned. --> Adding this to the direct_path_platforms
2019-02-115,https://nlnet.nl/project/Replicant-graphics,https://redmine.replicant.us/projects/replicant/wiki/Tasks_funding#Graphics-acceleration
2019-02-115,https://nlnet.nl/project/Replicant-graphics,https://git.replicant.us/replicantgerrit.osmocom.org
--> takes us to the repo . Not sure how it will get cloned. Added this domain to direct_path_platforms
git.europalab.com
--> does not need any changes in the code. They all have owner/repogitlab.coko.foundation
--> 2 projectreference, 1 repourlcode.tvl.fyi
--> different projectref same repourl --> does not need any changes in the code. They all have owner/repo
projectref ... repourl
361 2021-04-061 ... https://code.tvl.fyi/tree/tvix
616 2023-08-262 ... https://code.tvl.fyi/tree/tvixsource.mnt.re
--> 2 different nlnet page same repo --> does not need any changes in the code. They all have owner/repo
projectref ... repourl
271 2020-06-050 ... https://source.mnt.re/reform/reform
577 2023-02-044 ... https://source.mnt.re/reform/reformcode.podlibre.org
--> 2 different nlnet page same repourl --> does not need any changes in the code. They all have owner/repo
projectref ... repourl
229 2020-02-089 ... https://code.podlibre.org/podlibre/castopod/
607 2023-04-128 ... https://code.podlibre.org/podlibre/castopod/git.fairkom.net
--> does not need any changes in the code. They all have owner/repogit.zx2c4.com
--> There are 4 rows only one goes to the repo which has branches. Not sure how it would clone
projectref ... repourl
76 2019-02-167 ... https://git.zx2c4.com/wireguard-linux/tree/dri --> path not found
124 2019-04-121 ... https://git.zx2c4.com/wireguard-windows --> This goes to the repo
125 2019-04-121 ... https://git.zx2c4.com/wireguard-windows/about/
160 2019-08-026 ... https://git.zx2c4.com/wireguard-rs/utils/initial_data_preparation.py
and add 2 columns to the original_df
:repodomain
and base_repo_url
. Saved the this dataframe as original_massive_df
Enhanced repository cloning and error logging functionality:
Introduced several improvements to the script responsible for cloning GitHub repositories and managing errors during the process.
Key Changes:
Directory Structure by Domain: Modified the cloning path to create subdirectories based on the 'repodomain' column from the input DataFrame.
Enhanced Error Handling:
data/error_log.txt
). DataFrame Updates:
Details:
This commit is related to the issues #61 and #56
Ensure Cross-Platform Compatibility: Verify and adjust the current logic used for parsing owner/repository information from GitHub repositories to ensure compatibility with other repository hosting platforms.