Open sschuberth opened 3 years ago
Sorry, I don't know much about casing. @MichaelTsengLZ @jeffmcaffer or @jeffmendoza may be able to help here.
I've proposed a PR at https://github.com/clearlydefined/curated-data/pull/7838.
Thanks @sschuberth for finding this. I vaguely recall doing a deep dive on case preservation and sensitivity and feel like there's a doc or issue around that talks about our handling strategy. Seem to recall that Dan did this work back in the day. Can't find it right now. On the surface it would appear that this curation is the right thing to do. I am however concerned about the root cause and whether this this change will either perpetuate or exacerbate the problem (the latter seems unlikely).
It is entirely possible that the actual package changed their casing at some point in time. That seems especially possible in the NuGet world (seen it before). In that case (hah! pun) we'd have to fall back on the aforementioned casing analysis. My vague recollection is that we ended up at case preserving but case insensitive. The basic assumption being that we did not find any providers that were case sensitive (There were some old cases in npm land I think but they were old and appeared to addressed).
/cc @nellshamrell
https://github.com/clearlydefined/service/issues/157 talks about this somewhat at the implementation level. Still feel like there was an analysis of all the supported (and some prime potential) package managers with some summarization / discussion.
At least for NuGet, given that the API requires the package ID to be lower case, maybe it makes sense to align on all lower case file names?
Or actually, does the casing of the file name ever matter? I guess not, as the name
from the coordinates
in the YAML should be taken for any lookups. Which means we could always (and for any package manager) align on lower case file names.
I'm hesitant to make any concrete/general call on this without consulting the prior research. Casing is fraught with oddities. For this particular PR it should be possible to validate that with the changes merged, the curations are still applied to the relevant component/version.
Stepping back a bit, @sschuberth, how did this curation come about with different casing? Is this coming from ORT tooling using lowercase? or some other workflow?
how did this curation come about with different casing?
I have no idea. To the best of my knowledge none of these curations were initially created by ORT or by myself in person. My brief checks just showed that the initial commit each was done by @clearlydefinedbot.
yeah, that likely means that the PR was entered via the website and we are unable to do proper attribution (there was a technical challenge IIRC)
Anyway, I believe that NuGet is case preserving but not case sensitive. The metadata for DiscUtils.Core has mixed case (see below) while the protocol/service, as you point out, lowercases everything. So the user view of the package is mixed case.
<package xmlns="http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd">
<metadata>
<id>DiscUtils.Core</id>
<version>0.15.1-ci0007</version>
Reaching way back I believe we our goal was to present users with the most authentic naming so are attempting to preserve the casing. So while the protocol is all lower, the package itself and the user view is mixed.
That still leaves us with the challenge of what casing to use in the underlying implementation given that Git is case-sensitive (your point here). I would have said that if the system is case IN-sensitive, we should toLower() the coordinates for storage. Again, reaching back, I recall there may have been a few places where we look at the storage layer and reverse engineer coordinates. That's not a great and we may have stopped doing that.
HAH! I found the doc https://github.com/clearlydefined/clearlydefined/blob/533afead4c95e3ace9dbd727e2a166d89d229af8/docs/providers.md Unfortunately that doesn't quite help in as much it captures observed behavior, not quite the reasoning why.
Path forward.
service
or clearlydefined
(near the doc) repo issue. Might get lost hereI would have said that if the system is case IN-sensitive, we should toLower() the coordinates for storage. Again, reaching back, I recall there may have been a few places where we look at the storage layer and reverse engineer coordinates. That's not a great and we may have stopped doing that.
So, what's missing for me in the path forward is:
The PR cited here IMO should NOT be merged
Then please feel free to close my PR (with a rationale pointing back to here) so that it does not accidentally get merged.
@nellshamrell - unclear if this Issue can be closed?
It's easy to verify the problem persists. Just try to clone this repo on Windows and you'll still get
Updating files: 100% (6359/6359), done.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:
'curations/git/github/Azure/azure-sdk-for-go.yaml'
'curations/git/github/azure/azure-sdk-for-go.yaml'
'curations/nuget/nuget/-/7Zip4PowerShell.yaml'
'curations/nuget/nuget/-/7Zip4Powershell.yaml'
'curations/nuget/nuget/-/CodeProject.ObjectPool.yaml'
'curations/nuget/nuget/-/codeproject.objectpool.yaml'
'curations/nuget/nuget/-/DiscUtils.yaml'
'curations/nuget/nuget/-/Discutils.yaml'
'curations/nuget/nuget/-/Elmah.MVC.yaml'
'curations/nuget/nuget/-/Elmah.Mvc.yaml'
'curations/nuget/nuget/-/EMGU.CV.yaml'
'curations/nuget/nuget/-/Emgu.CV.yaml'
'curations/nuget/nuget/-/EnvDTE.yaml'
'curations/nuget/nuget/-/envdte.yaml'
'curations/nuget/nuget/-/EnvDTE100.yaml'
'curations/nuget/nuget/-/envdte100.yaml'
'curations/nuget/nuget/-/EnvDTE80.yaml'
'curations/nuget/nuget/-/envdte80.yaml'
'curations/nuget/nuget/-/EnvDTE90.yaml'
'curations/nuget/nuget/-/envdte90.yaml'
'curations/nuget/nuget/-/EnvDTE90a.yaml'
'curations/nuget/nuget/-/envdte90a.yaml'
'curations/nuget/nuget/-/GitLink.yaml'
'curations/nuget/nuget/-/gitlink.yaml'
'curations/nuget/nuget/-/Jint.yaml'
'curations/nuget/nuget/-/jint.yaml'
'curations/nuget/nuget/-/LINQKit.yaml'
'curations/nuget/nuget/-/LinqKit.yaml'
'curations/nuget/nuget/-/Microsoft.AspNetCore.Authentication.JwtBearer.yaml'
'curations/nuget/nuget/-/microsoft.aspnetcore.authentication.jwtbearer.yaml'
'curations/nuget/nuget/-/MongoDB.LibMongocrypt.yaml'
'curations/nuget/nuget/-/MongoDB.Libmongocrypt.yaml'
'curations/nuget/nuget/-/MSBuildTasks.yaml'
'curations/nuget/nuget/-/MsBuildTasks.yaml'
'curations/nuget/nuget/-/PDFsharp-MigraDoc-GDI.yaml'
'curations/nuget/nuget/-/PDFsharp-MigraDoc-gdi.yaml'
'curations/nuget/nuget/-/React.Core.yaml'
'curations/nuget/nuget/-/react.core.yaml'
'curations/nuget/nuget/-/Semver.yaml'
'curations/nuget/nuget/-/semver.yaml'
'curations/nuget/nuget/-/StructureMap.yaml'
'curations/nuget/nuget/-/structuremap.yaml'
'curations/nuget/nuget/-/System.Data.HashFunction.SpookyHash.yaml'
'curations/nuget/nuget/-/system.data.hashfunction.spookyhash.yaml'
'curations/nuget/nuget/-/System.Windows.Interactivity.WPF.yaml'
'curations/nuget/nuget/-/system.windows.interactivity.wpf.yaml'
'curations/nuget/nuget/-/TinyMCE.JQuery.yaml'
'curations/nuget/nuget/-/TinyMCE.jQuery.yaml'
'curations/nuget/nuget/-/ValueInjecter.yaml'
'curations/nuget/nuget/-/valueinjecter.yaml'
'curations/nuget/nuget/-/VsWhere.yaml'
'curations/nuget/nuget/-/vswhere.yaml'
'curations/nuget/nuget/-/Xamarin.GooglePlayServices.Gcm.yaml'
'curations/nuget/nuget/-/xamarin.googleplayservices.gcm.yaml'
'curations/pypi/pypi/-/Pillow.yaml'
'curations/pypi/pypi/-/pillow.yaml'
'curations/pypi/pypi/-/Resource.yaml'
'curations/pypi/pypi/-/resource.yaml'
'curations/pypi/pypi/-/wxPython.yaml'
'curations/pypi/pypi/-/wxpython.yaml'
@ariel11 let's keep it open. I don't know when I will have time to address this, but it's good to keep it on our backlog.
From what I do I use lower case for my files one it's faster to write and to the mainframe will except lower case for filing. And it keeps it from working harder. You start doing fancy stuff then there become problems.as long as there is great security for the files
When cloning this repo on Windows I get:
I believe the contents of these pairs of files should be merged, and the capitalization should be aligned to the spelling used in the respective registry. Would you agree with that rationale @capfei @fossygirl?