aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 551 forks source link

cocoapods.py generates some JSON-related URLs that lead to a 404 #3715

Open johnmhoran opened 7 months ago

johnmhoran commented 7 months ago

In connection with a purl2url issue in packageurl-python, I've been exploring the URL-related code in packagedcode's cocoapods.py. With the four PURL spec examples for cocoapods,

pkg:cocoapods/AFNetworking@4.0.1
pkg:cocoapods/MapsIndoors@3.24.0
pkg:cocoapods/ShareKit@2.0#Twitter
pkg:cocoapods/GoogleUtilities@7.5.2#NSData+zlib

I got the following results looking for potentially useful JSON files. Using the pattern

f'https://raw.githubusercontent.com/CocoaPods/Specs/blob/master/Specs/{hashed_path}/{name}/{version}/{name}.podspec.json'

from the get_urls() api_data_url variable, we get the following URLs, each of which leads to a 404: Not Found page:

A few lines above that pattern in get)url() is a pattern for the specs_json_cdn_url variable

f'https://cdn.cocoapods.org/Specs/{hashed_path}/{name}/{version}/{name}.podspec.json' 

For the same four cocoapods PURLs, this pattern generates valid URLs to cdn.cocoapods.org JSON files:

BTW, some of the URL data in these four .json files is not (or no longer) valid, e.g., the "homepage" URL for ShareKit (http://getsharekit.com/) leads to what might be a Turkish-language page -- https://smartem.org/. (The "source"/"git" URL (https://github.com/ShareKit/ShareKit.git) is valid and reflects that the last commit was made in December 2017.)

pombredanne commented 7 months ago

Remove /blob from the URLs:

pombredanne commented 7 months ago

So use instead: f'https://raw.githubusercontent.com/CocoaPods/Specs/master/Specs/{hashed_path}/{name}/{version}/{name}.podspec.json'

pombredanne commented 7 months ago

BTW, some of the URL data in these four .json files is not (or no longer) valid,

This would be another issue entirely ... We should have separate ways to crawl and tag invalid or dead URLs, and this would be implemented likely in the PurlDB, as some improver.