LLNL / scraper

Python library for getting metadata from source code hosting tools
MIT License
49 stars 23 forks source link

Support scraping Subversion #35

Open leebrian opened 5 years ago

leebrian commented 5 years ago

Being able to scrape subversion projects would be helpful and is not yet supported. It's a pretty low priority for my agency, but you requested we add issues for examples of repos not yet supported.

IanLee1521 commented 5 years ago

Hmm. Yeah, pretty low for me too. I'm trying to think how we might do this for arbitrary SVN repos (where we get the metadata itself).

Are there specific SVN hosting tools that we would specifically want / need to target?

I wonder if we can get a list of all of the repositoryURLs from code.gov (cc/ @RicardoAReyes) to try to find the hosting platforms to target..? Guess that is more justification for #29 ;)

leebrian commented 5 years ago

I think we have about 100-200 projects or so but haven't counted yet since no one is really asking internally and since they aren't scraped properly to determine if they are excludable, it's a viscous cycle since people can't find them.

I'm not sure what hosting tools to target. I was reading through the svn book's api chapter and it seems like a crawl using the svn client to checkout every directory and then got through it to find history and comments and maybe enough metadata. I haven't looked at it since them because it seemed like a decent amount of boring work digging into svn history files and such.

I tried checking all the repos, but https://api.code.gov/repos?size=10000 only returned 1000 of the reported 6565 repos. None of those thousand had subversion and they were all vcs=git.

gmkarl commented 4 years ago

Hi, found the help-wanted tag for this issue on code.gov .

You can see more than 1000 repos at once by passing '&from=[start]' to the code api. I used the api node.js module to check vcs= for all of them. 2496 don't have a vcs field, 200 have an empty string, 1 has 'zip', and the 3863 others are all some form of 'git'. That's all of them.

Here's a list of all the repository urls by repository ID: repository_ids_and_urls.txt

gmkarl commented 4 years ago

I tried doing a simple 'svn co ${url}' on each repositoryURL (so no authentication performed). It only worked on two projects: https://code.gov/projects/doe_office_scientific_technical_information_osti_1_kepler and https://code.gov/projects/doe_office_scientific_technical_information_osti_1_zeptoos . What did I miss?

leebrian commented 4 years ago

What kind of metadata can you extract from those checkouts? Are you able to populate the code.json elements?

gmkarl commented 4 years ago

They're just source code repositories, containing branch and tag names, sourcefiles, and detailed change history, and that's it. Human intervention would be needed for many fields, but you could auto-populate things like vcs=svn and maybe offer guesses for things like releases, license, e-mail, or description, based on repo content. I'm guessing that even a barebones code.json file is helpful, here.