require minimal edit distance for new package names

StefanKarpinski commented 5 years ago

We should require manual review of new packages whose names are less than some edit distance from an existing package name: https://discourse.julialang.org/t/announcement-automatic-merging-for-the-general-registry/29961/10.

DilumAluthge commented 5 years ago

Agree. Should not be too difficult. Step one: get a list of all of the package names in the general registry. This is pretty fast. Step two: for each package in the general registry, compute the edit distance with the name of the proposed package. If at any point the edit distance is less than our threshold, we can short circuit and return a decision of no auto merge.

We need to make a few decisions. Decision 1: which edit distance algorithm to use? Hamming? Levenshtein? Decision 2: what cut off to use?

DilumAluthge commented 5 years ago

Decision three: should this be only for new packages? Or should we do this for both new packages and new versions of existing packages?

DilumAluthge commented 5 years ago

Maybe we can keep a white list of packages that we know are legitimate and have similar names? For example, we know that PGFPlots and PGFPlotsX are both legitimate (not malicious) packages. And then if you are not in the white list then we will require manual review even for new versions of existing packages.

DilumAluthge commented 5 years ago

It might also be a good idea to take the current version of the general registry and run our edit distance algorithm pairwise on all packages currently in the registry. That way we can get a sense of what the current situation is.

StefanKarpinski commented 5 years ago

https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance because transpositions are an easy, common mistake that shouldn't count as two edits.
A cutoff of 1 may be sufficient given that D-L allows transpositions, but 2 would be safe. You can short-circuit the computation of D-L to stop when a safe edit distance is reached, but I kind of doubt we need that kind of performance here—a few thousand words is not very many.
Only new packages, although we may want to review the set of existing names for potentially malicious collisions that are already there (I highly doubt it since we've got actual humans reviewing merges so far).

StefanKarpinski commented 5 years ago

Also, note that I'm not suggesting that we disallow names that are too close, I'm simply saying that if the edit distance is too small, we require manual approval, so this is an automation guideline like all the others, not a hard requirement. The main concern is preventing people from squatting typos.

DilumAluthge commented 5 years ago

100% agree with all of the above

DilumAluthge commented 5 years ago

Here is some code that runs the edit distance on the current registry. cd to the directory containing the registry, and then run the code. You need to install the StringDistances package.

using Pkg
using StringDistances

function get_all_package_names(registry_dir::AbstractString)
    packages = [x["name"] for x in values(Pkg.TOML.parsefile(joinpath(registry_dir, "Registry.toml"))["packages"])]
    sort!(packages)
    unique!(packages)
    return packages
end

@inline function damerau_levenshtein(a::AbstractString, b::AbstractString)
    a_clean = lowercase(strip(a))::String
    b_clean = lowercase(strip(b))::String
    return evaluate(DamerauLevenshtein(), a_clean, b_clean)
end

pkgs = get_all_package_names(pwd())
possible_collisions = Vector{Tuple{String, String, Int}}(undef, 0)
for i = 1:length(pkgs)
    for j = (i+1):length(pkgs)
        a = pkgs[i]
        b = pkgs[j]
        distance = damerau_levenshtein(a, b)
        if distance <= 2
            result = (a, b, distance)
            println(result)
            push!(possible_collisions, result)
        end
    end
end

DilumAluthge commented 5 years ago

Actually, that version will return a lot of false positives, just because short package names (length < 4) will have low edit distances to other short package names.

Here's a version of the code that excludes package names of length < 4.

pkgs = get_all_package_names(pwd())
possible_collisions = Vector{Tuple{String, String, Int}}(undef, 0)
for i = 1:length(pkgs)
    for j = (i+1):length(pkgs)
        a = pkgs[i]
        b = pkgs[j]
        if length(a) > 4 && length(b) > 4
            distance = damerau_levenshtein(a, b)
            if distance <= 2
                result = (a, b, distance)
                println(result)
                push!(possible_collisions, result)
            end
        end
    end
end

ericphanson commented 4 years ago

I've been trying this code out locally. First, I think JLL packages should be excluded from this or treated separately; they already receive manual review at Yggdrasil, so I think it's an unnecessary hurdle for them here (and also they don't get to choose their own names usually).

For the matching criterion: it seems like there are two related choices to make: normalization, and allowed distances (perhaps as a function of length).

I think normalization as @DilumAluthge did (lowercase and stripping) makes a lot of sense, since it seems easy to forget to capitalize. For example, it's easy to type Jump or JumP instead of JuMP. Even just disallowing distance-0 collisions after normalization seems like an easy win; two packages shouldn't differ only by capitalization, since that is just confusing.

However, there is a tradeoff between distance and normalization. If we don't normalize but disallow distance-2 collisions, then both Jump and JumP are disallowed anyway (and JumP is ruled out by disallowing just distance-1 collisions). I think normalization + disallowing distance-2 collisions is probably overkill, and causes a lot of false positives. For example, there are 137 distance-2 collisions (using Dilum's normalization) of pairs of packages (a,b) such that length(a)+length(b) > 10 (restricting to longer names to prevent trivial overlaps):

("AIControl", "DFControl")
("AMQPClient", "SMTPClient")
("AWSS3", "AWSSDK")
("AWSS3", "AWSSQS")
("AWSSDK", "AWSSQS")
("Actors", "Attrs")
("AdvancedHMC", "AdvancedMH")
("Agents", "Events")
("Alpaca", "Arpack")
("Arpack", "PROPACK")
("Arpack", "UnPack")
("Atmosphere", "ISAtmosphere")
("BEAST", "Beauty")
("BIDSTools", "BioTools")
("BKTrees", "D3Trees")
("BKTrees", "VPTrees")
("BeaData", "BlsData")
("BigArrays", "DimArrays")
("BioTools", "IRTools")
("BitFlags", "BitFloats")
("CFTime", "CPUTime")
("CMPFit", "CMPlot")
("CMPlot", "PyPlot")
("COESA", "Caesar")
("CSVFiles", "FCSFiles")
("CVortex", "VerTeX")
("Caching", "Packing")
("Caesar", "Catsay")
("CatViews", "FFTViews")
("Catlab", "Catsay")
("ChrBase", "StrBase")
("Clang", "Dolang")
("Clustering", "ClusteringGA")
("Clustering", "DPClustering")
("CodecLz4", "CodecXz")
("CodecXz", "Codecs")
("Codecs", "NodeJS")
("Coulter", "Counters")
("CuArrays", "GPUArrays")
("D3Trees", "VPTrees")
("DBInterface", "ODEInterface")
("DFTforge", "GitForge")
("DIVAnd", "Diana")
("Dagger", "Swagger")
("DataArrays", "MetaArrays")
("DataIO", "FastaIO")
("Destruct", "ToStruct")
("DiffEqBase", "DiffEqBayes")
("DimArrays", "SymArrays")
("Distributions", "DistributionsAD")
("EchoviewEcs", "EchoviewEvr")
("EcoBase", "YaoBase")
("Einsum", "OMEinsum")
("EvoTrees", "VPTrees")
("ExportAll", "ImportAll")
("FEMBase", "FEMBasis")
("FEMBase", "HMMBase")
("FEniCS", "HELICS")
("FITSIO", "FileIO")
("FTPClient", "HTTPClient")
("FTPClient", "SMTPClient")
("GLMakie", "Makie")
("GPUArrays", "GeoArrays")
("GetGene", "TetGen")
("GracePlot", "GraphPlot")
("GraphIO", "Graphics")
("GraphIO", "Graphs")
("Graphics", "Graphs")
("HMMBase", "MLBase")
("HTTPClient", "SMTPClient")
("IJulia", "JuLIP")
("IJulia", "OMJulia")
("IJulia", "SimJulia")
("IRTools", "IterTools")
("Images", "Pages")
("InPlace", "Inflate")
("JLBoost", "XGBoost")
("JSCall", "PyCall")
("JSCall", "RCall")
("Joseki", "Mosek")
("JuMPeR", "Juniper")
("JuliaDB", "JuliaZH")
("JuliaDB", "julia")
("JuliaZH", "julia")
("LRSLib", "SASLib")
("LasIO", "Plasmo")
("Lasso", "Plasmo")
("LibPQ", "LibPSF")
("LinearMaps", "LinearMapsAA")
("MATLAB", "MatLang")
("MDDatasets", "NCDatasets")
("MDDatasets", "RDatasets")
("MINPACK", "MsgPack")
("MINPACK", "UnPack")
("MLDatasets", "NCDatasets")
("MLDatasets", "NLIDatasets")
("MLDatasets", "RDatasets")
("MLJModels", "NLPModels")
("Match", "Rematch")
("MeshArrays", "MetaArrays")
("MeshIO", "Meshing")
("Mocking", "Packing")
("Modia", "Modia3D")
("MortarContact2D", "MortarContact2DAD")
("NCDatasets", "NLIDatasets")
("NCDatasets", "RDatasets")
("NFLTables", "Nullables")
("NaiveGAflux", "NaiveNASflux")
("OMJulia", "SimJulia")
("OMJulia", "julia")
("OPFSampler", "PDSampler")
("Packing", "Tracking")
("Pandas", "Pandoc")
("Pidfile", "ZipFile")
("Plotly", "PlotlyJS")
("Plotly", "Plots")
("PosDefManifold", "PosDefManifoldML")
("Primes", "Vimes")
("PyCall", "RCall")
("RData", "RDates")
("Rematch", "Rmath")
("ResumableFunctions", "ReusableFunctions")
("RobotOS", "Roots")
("ScHoLP", "ZChop")
("Slack", "Slacker")
("Slacker", "Tracker")
("StanBase", "StatsBase")
("StanBase", "StrBase")
("StanModels", "StatsModels")
("Tensors", "Xtensor")
("Traceur", "Tracker")
("UAParser", "URIParser")
("Unitful", "UnitfulMR")
("Unitful", "UnitfulUS")
("UnitfulMR", "UnitfulUS")
("WinRPM", "WinReg")
("YaoBase", "YaoQASM")

A lot of these seem pretty different, and I think it would be annoying to need to manually merge e.g. RobotOS because it's similar to Roots. (And those are in addition to distance-1 collisions, collisions from packages with short names, etc).

I think a good balance is to normalize and only disallow distance-1 collisions. With this choice, we have 76 false positives (out of ~3 million pairs):

("ACME", "ADCME")
("AMD", "Amb")
("BAT", "MAT")
("BDF", "EDF")
("BDF", "JDF")
("BSON", "JSON")
("Bits", "LITS")
("CRC", "Cbc")
("CSV", "uCSV")
("Catlab", "MATLAB")
("CoDa", "Conda")
("Conda", "Onda")
("Cubature", "HCubature")
("Debugger", "Rebugger")
("Dolo", "YOLO")
("EDF", "JDF")
("EMIRT", "MIRT")
("Fire", "Jfire")
("GAP", "GCP")
("GLM", "Glo")
("GLM", "Gym")
("GLMakie", "WGLMakie")
("GMT", "Git")
("GSL", "HSL")
("GSL", "LSL")
("GSL", "VSL")
("Glo", "Glob")
("Gtk", "ITK")
("H3", "Z3")
("HAML", "YAML")
("HSL", "LSL")
("HSL", "VSL")
("Hive", "Jive")
("IJulia", "julia")
("JDBC", "ODBC")
("JLD", "JLD2")
("JSON", "JSON2")
("JSON", "JSON3")
("JSON2", "JSON3")
("JWAS", "JWTs")
("JuLIP", "Tulip")
("JuLIP", "julia")
("LSL", "VSL")
("LasIO", "Lasso")
("LasIO", "LazIO")
("LiBr", "Libz")
("MCMCChain", "MCMCChains")
("MDDatasets", "MLDatasets")
("MIDI", "Mimi")
("MLBase", "MLJBase")
("MPIReco", "MRIReco")
("Mads", "Mods")
("Match", "Matcha")
("Media", "Modia")
("NMF", "NMFk")
("NMFk", "NTFk")
("NRRD", "Nord")
("NTFk", "NTNk")
("Ogg", "Org")
("PGFPlots", "PGFPlotsX")
("QDates", "RDates")
("RSCG", "Rsvg")
("SCS", "WCS")
("SMC", "SMM")
("SMM", "SOM")
("Sass", "Soss")
("StanMCMCChain", "StanMCMCChains")
("StanSample", "StanSamples")
("StatPlots", "StatsPlots")
("StrFs", "Strs")
("Tar", "Taro")
("Tar", "Tau")
("TeXTable", "TexTables")
("TreeView", "TreeViews")
("XSim", "Xsum")
("YAJL", "YAML")

23 of these pairs are with both packages being 3 characters long; we could do something special there to relax the criterion, but probably it's not worth it (since we shouldn't incentivize using up that namespace).

The real problem though, of course, is false negatives, which we can't just check against the existing registry. For example, pretend TexTables doesn't exist, and we wish to typo-squat TeXTable. Then if you try just adding an s, it will fall under our distance-1 rule. But if you also swap the lowercase "L" for a capital "i", then it will be distance-2, and still look pretty close to the original. (Yes, you probably wouldn't type that in by mistake, but you might copy it from a malicious website or github page).

This motivates ruling out distance-2 collisions as well. But perhaps we could just try to also normalize out these differences, by e.g. replacing all "L"s with "i"s, e.g.

(s -> replace(s,  "l" => "i")) ∘ lowercase ∘ strip

This change adds just 8 false positives to the 76 above:

 ("Bio", "Glo")    
 ("Clp", "SCIP")   
 ("GLM", "Git")    
 ("GLTF", "Git")   
 ("Git", "Glo")    
 ("MIDI", "Mill")  
 ("Mill", "Mimi")  
 ("Phylo", "PlyIO")

There's a table on https://www.ismp.org/resources/misidentification-alphanumeric-symbols with other misidentifiable symbols, but lowercase "L" and uppercase "i" look to be by far the worst.

TL;DR:

The criterion I think might strike a good balance between false positives and false negatives is:

Ignore JLL packages
Normalize package names before comparison via (s -> replace(s, "l" => "i")) ∘ lowercase ∘ strip (though I'm not sure the strip is needed)
Disallow automerging of packages yielding distance 1 (or less) collisions in Damerau-Levenshtein distance with another package

Any adversarial examples against this criterion would be great :).

DilumAluthge commented 4 years ago

I think it would be annoying to need to manually merge e.g. RobotOS because it's similar to Roots.

It's worth noting that this would only apply to the first time RobotOS is merged as a new package. So the first time RobotOS is registered, the new package registration for RobotOS would need to be manually merged. But for all subsequent registrations of RobotOS, because they would now be new version registrations, those registrations would be automerged.

DilumAluthge commented 4 years ago

Ignore JLL packages

Agree.

Normalize package names before comparison via (s -> replace(s, "l" => "i")) ∘ lowercase ∘ strip (though I'm not sure the strip is needed)

Agree 100%. This is a great idea. There's no harm in including the strip.

Disallow automerging of packages yielding distance 1 (or less) collisions in Damerau-Levenshtein distance with another package

Personally, I would be fine using distance 2 (or less) as the criterion. I'd rather have false positives than false negatives. And since we are only enforcing this criterion at new package registration time, I don't think the manual merge burden will be too much.

StefanKarpinski commented 4 years ago

Some thoughts... I wonder if it might not be fruitful to not consider all differences equally significant. One upper versus lowercase difference may be an attack, but when most of the letters are a difference case, that's pretty clearly different. Some letters are very easily confusable, whereas others are not: if the difference is between X and S, that's pretty obvious. If the difference is between I and l, that's something to be worried about. We may want to be a bit more nuanced about which changes are worth worrying about.

ericphanson commented 4 years ago

That makes sense-- building on that, I think it’s good to start with an idealized distance which describes what we actually want, and then try to find some way to approximately compute it (/ estimate it), which we can feasibly do. The following is my attempt to have a start at it.

Idealized distance: Let p be the probability someone would type package X when they meant to type package Y. Then let the distance between X and Y be 1-p (or some other strictly monotone decreasing function of p, like -log(p)).

To estimate p, we need to know the ways people type packages:

Copy from somewhere (e.g. malicious website)
Type manually

In the first case, what matters is how close the two packages’ written names look. In the second case, what matters is how people mistype.

I suspect the second case is more common, but if two package names are visually confusing that seems generally bad anyway, so I think it’s reasonable to weight the two equally (if we even have a way to do so in the end).

Next, we want to estimate each of

The probability the user sees the name of package Y and thinks it looks like the name of package X.
The probability the user enters the name of package Y when meaning to enter the name of package X by making typos.

In the second case, we can look at empirical typo frequencies— some of the answers to https://stackoverflow.com/questions/3419400/real-world-typo-statistics look like they might be helpful (and maybe there’s good data elsewhere).

For the first case, the paper Inverse discrimination time as a perceptual distance for alphabetic characters: Visual Cognition: Vol 11, No 7 on p. 908 gives a table of inverse “letter discrimination times” to define a metric based on how long it took experimental participants to discriminate between two letters. We could use this as a proxy metric for visual confusion. They only tested lowercase letters in size 12 Arial, however. Also, looking just at individual letters is definitely missing something (e.g. rn kind of looks like m), but does have the advantage of simplicity (only 26x26 pairs). But maybe there is some extension of this or related work that could help (I didn’t find anything yet though).

I think the typo case is generally easier to treat than the visual-confusion one, since it seems like for typos, each substitution or deletion is independent (though the probabilities depend on the letters in question), whereas for visual similarity, bigrams etc are likely important too.

We could estimate a distance for each (typo-similarity and visual similarity), which we hope is proportional to some strictly decreasing function of the probability of making these errors, and then e.g. add them, to come up with an overall similarity distance between the two package names.

StefanKarpinski commented 4 years ago

I wonder if it would be over the top to use GNU Unifont to compute some kind of visual distance between characters. The fact that Unifont is a bitmap font is actually useful for this application (whereas it makes the font not ideal for most modern uses). The simplest metric would be some kind of pixel distance metric. Slightly more fancy would be allowing the two characters to slide around in the plane to minimize the pixel difference, which would handle cases where, e.g. one character is a pixel left of another one but otherwise they look the same.

StefanKarpinski commented 4 years ago

Another font family with broad coverage: https://en.wikipedia.org/wiki/Noto_fonts. Note that we don't actually need all of this since package names must be valid Julia identifiers.

ericphanson commented 4 years ago

I wonder if it would be over the top to use GNU Unifont to compute some kind of visual distance between characters. The fact that Unifont is a bitmap font is actually useful for this application (whereas it makes the font not ideal for most modern uses). The simplest metric would be some kind of pixel distance metric. Slightly more fancy would be allowing the two characters to slide around in the plane to minimize the pixel difference, which would handle cases where, e.g. one character is a pixel left of another one but otherwise they look the same.

Just to say that I’ve been trying this out locally, using an “optimal transport” based metric, and it seems to work pretty well. For example, both “LasIO” v “LazIO” and “WGLMakie” and “GLMakie” are 1-apart in DL distance, but the visual optimal transport metric puts LasIO and LazIO very close while WGLMakie and GLMakie rather far, which checks out to me.

It still needs some tuning and refinement, and I think choosing a particular cutoff might be tricky, but I’ll try to put up some code soon to discuss. I’m thinking of putting it in a package called something like “VisualStringDistances”, which can be transferred to this org or such later.

ericphanson commented 4 years ago

I ended up going on a bit of a longer journey with this than I had planned, but step 1 of 2 is done: an implementation of "unbalanced Sinkhorn divergences" in UnbalancedOptimalTransport.jl. It's a pretty simple algorithm translated from a paper, and I tried to document and test it well. I ended up separating it out from the rest (described below), because it seemed self-contained and independently useful, and is dependency-free.

Step 2 is VisualStringDistances.jl, which loads the Unifont data, extracts the bitmap and represents it sparsely, then uses UnbalancedOptimalTransport for the Sinkhorn divergences to measure distances between strings. This one is still a work-in-progress: few tests or docs, code can probably be cleaned up, and there's several parameters to choose to figure out what makes a good distance (e.g. how much to penalize mass creation / destruction).

ericphanson commented 4 years ago

As an update on this, I gave a JuliaCon talk on VisualStringDistances.jl to describe the technique and have played around with it a bit; in the docs here I discuss a bit the application to package names.

My current thinking is that a weighted edit distance makes a lot of sense:

it can still take into account perceptual distances between the characters
it’s fast: we just need to generate a table between characters once and then use that in the weighted edit distance (I don’t think there are any packages with non-ascii names so we could just add that as a check too and only need weights between ascii letter/number characters).
it’s easy to describe and explain for when names get flagged
we could incorporate mirror symmetry and other human perceptual biases that the optimal transport based distance can’t pick up on.

I think the weighted bit is key, because it provides much more fine-grained information, and we might even be able to set the tolerance so low that any current packages would not have been flagged but tricky I vs l ones would be.

We can also use a typo-based weighted edit distance too (ie which typos are common on QWERTY keyboards).

Therefore I think the next step on this quest might be https://github.com/matthieugomez/StringDistances.jl/issues/25. We could use that for both a typo-based distance and a perceptual one by using different weights.

If we just want to get something done first though, we could use visual distance; I think it’s a decent option, and the speed issues can be handled by gating the check behind a looser edit-distance based check. (Remembering as @StefanKarpinski said on discord that perfect is the enemy of good!).

JuliaRegistries / RegistryCI.jl

require minimal edit distance for new package names #10