Does the projects endpoint need to be case sensitive

spencerschrock commented 12 months ago

Scorecard uses the /projects.json?url= endpoint when checking a project's best practices badge status. We've have an open issue (https://github.com/ossf/scorecard/issues/3466) where some projects call scorecard with a different capitalization than their official repo name.

For example, github.com/kubearmor/KubeArmor was effectively running scorecard with:

scorecard --repo github.com/kubearmor/kubearmor

Which makes an http call to /projects.json?url=https://github.com/kubearmor/kubearmor, which provides an empty response. Compared to the expected call of /projects.json?url=https://github.com/kubearmor/KubeArmor which is a hit.

GitHub has an API call which returns the "official" capitalization of a repo, so Scorecard can likely satisfy this requirement when we make the request, but opening an issue in case this was unintentional.

I know the code in question is here, but I'm not much of a Ruby guy. There's also some comments about indices and efficiency, so feel free to close this if it's not feasible. https://github.com/coreinfrastructure/best-practices-badge/blob/91b34744bb57aa7f4639e6192a9ca21597ca01b1/app/models/project.rb#L123-L132

david-a-wheeler commented 12 months ago

Hmmm.

In the general case the URL pathname can be case-sensitive, so it's best to treat it that way. Git is also normally case-sensitive (depending on the underlying filesystem). Some specific systems have pathnames that are case-insensitive, but there's no obvious way to determine which is which. We support arbitrary repos, not just GitHub and GitLab.

It's true that the domain name is not case sensitive when it's ASCII, per IETF RFC 4343. E.g., "I" and "i" are considered the same (apologies to those who speak Turkish).

Handling this in the general case is hard. Here's one idea:

Do a case-sensitive search. If that works, use it.
If that fails, but a case-insensitive search finds a result, use that.

What do you think?

TonyLHansen commented 12 months ago

wouldn't URI.parse allow you to grab the path separately from the fqdn?

spencerschrock commented 12 months ago

Handling this in the general case is hard. Here's one idea:

Do a case-sensitive search. If that works, use it.

If that fails, but a case-insensitive search finds a result, use that.

To clarify, is this your proposal for scorecard or for the best practices API?

david-a-wheeler commented 11 months ago

wouldn't URI.parse allow you to grab the path separately from the fqdn?

Yes, it's definitely possible. The problem is "what to do with the information". Whether or not the path is case-sensitive depends on the details of the specific system being queried. It can even change over time for a given system being queried. I think "case-sensitive first, then case-insensitive" covers all cases and is simpler to implement.

To clarify, is this your proposal for scorecard or for the best practices API?

I'm thinking of this as a proposal for the best practices badge, as this is an issue against the best practice badge.

This might make sense to do this in Scorecard as well, but I think that should be a different issue in that case.

spencerschrock commented 11 months ago

It can even change over time for a given system being queried. I think "case-sensitive first, then case-insensitive" covers all cases and is simpler to implement.

I think the only case this doesn't cover is a false match. Consider a host where path is case sensitive, and there are two projects, but only one is in the best practices dataset:

foo.com/x/y (not in the dataset)
foo.com/X/Y (in the dataset)

A request for foo.com/X/Y would successfully case match and return the intended project. A request for foo.com/x/y would miss the exact match and return the wrong project during the case insensitivity.

david-a-wheeler commented 8 months ago

@spencerschrock - you're right, this approach does risk a false match. I think the risk is low, but it does give pause. I can't think of another approach though, so I think we end up with two possibilities:

Close this unchanged.
Make the change proposed (case-sensitive then case-insensitive).

Anyone have a third way?

TonyLHansen commented 8 months ago

I say: accept the risk and go with case-sensitive then case-insensitive.

coreinfrastructure / best-practices-badge

Does the projects endpoint need to be case sensitive #2063