denoland / deno_registry2

The backend for the deno.land/x service
https://deno.land/x
MIT License
92 stars 14 forks source link

Moderation filters #58

Open lucacasonato opened 4 years ago

lucacasonato commented 4 years ago

We should automatically moderate the names of modules people are uploading. I think we can start with these three steps (ordered by priority):

  1. Add a list of reserved module names that can not be registered automatically. Easiest would be a json file with an array of disallowed names. (@lucacasonato)
  2. Check any new module name against a list of 'bad' words. We need to find a list to use (https://www.cs.cmu.edu/~biglou/resources/bad-words.txt is not good as it blocks words everyday words like color, queer, or africa). (up for grabs)
  3. Disallow any module names that have a levenshtein distance of less than 3 to any other existing module name, bad word, or reserved module name. (up for grabs)
nayeemrmn commented 4 years ago

3. Disallow any module names that have a levenshtein distance of less than 3 to any other existing module name, bad word, or reserved module name. (up for grabs)

Unless I'm misunderstanding I'm not sure how this can possibly work. E.g. eslint and tslint, or any two dictionary words that happen to be a letter apart https://listography.com/spamtastic/words/that_are_one_letter_apart let alone 2. What do npm or cargo do about this?

lucacasonato commented 4 years ago

Unless I'm misunderstanding I'm not sure how this can possibly work. E.g. eslint and tslint, or any two dictionary words that happen to be a letter apart https://listography.com/spamtastic/words/that_are_one_letter_apart let alone 3.

I am not locked into the exact distance (if 1 gives desired results, we can do that). What we are trying to prevent is someone registering oak2 or oakk or 0ak. So that if you mistype or are not too familiar with Deno modules yet you do not accidentally install the wrong module (that might be malicious). I don't want someone to publish color and someone else to publish colour. Things like that are so confusing.

Yeah, this means that some module names are not available, but I think that cost is worth it.

What do npm or cargo do about this?

AFAIK npm does nothing about this (see https://www.npmjs.com/package/exxpress or https://www.npmjs.com/package/expres). For cargo I do not know.

nayeemrmn commented 4 years ago

I don't think it's worth it. Especially with Deno where you're more likely to get the correct URL by copy-pasting it from somewhere, the mistyping problem should be especially rare and we can chalk the rest of it up to personal responsibility. There just isn't a nice rhyme or reason to what words are close together in distance -- weird names can rule out ubiquitous names just by being there first. And as I said it's far too usual for common words to be a letter apart.

There are better solutions. Have a dictionary for things like color and colour to make them specially mutually exclusive. Allow reputable modules to "claim" similar names (as they would buy similar domains). Use down-scoring based on name similarity to something well-known.

Soremwar commented 4 years ago

Perhaps this is a common issue in NPM where you mistype a letter and get the wrong module, but the URL system does request more attention from the user at the time of choosing a library.

Oak2 might be a completely valid name to submit in my opinion.

wperron commented 4 years ago

I can get started on the bad words filter 👌

wperron commented 4 years ago

Found a couple of lists that we could use for the comparison:

@lucacasonato what do you think?

ry commented 4 years ago

@wperron Thanks! Any of those work, I'm sure...

lucacasonato commented 4 years ago

Can't we just combine all three into one?

@wperron Do you think we should store with the source code, or as a table in the database that we check against?

wperron commented 4 years ago

I don't want to have the list just disappear from under us, so my plan was to copy the list into the project. Tbh, I don't know if creating a collection in Mongo just to store a couple of swear words is really worth it, plus putting it in the repository gives it a lot of visibility, we can link to the file in the README for example.

As for combining all three of them, yes of course we can 😛

lucacasonato commented 4 years ago

plus putting it in the repository gives it a lot of visibility

We might not want that. Getting around it is a lot easier then :-). A database collection makes it a lot harder to find which words are included.

wperron commented 4 years ago

@lucacasonato do you have a list of reserved module names ready to go? I could include that check in #81 while I'm at it

lucacasonato commented 4 years ago

@wperron Reserved module names are now handled as unlisted modules without uploaded versions. Easier because we can store them in the DB that way.

TomokiMiyauci commented 3 years ago

Hi!

I'm making a package name validation library. https://github.com/TomokiMiyauci/is-valid-package-name/tree/beta/deno_land

Deno seems to confirm the contents of badwords.txt in S3 with the validation of the module name. Is there a way for me to check the contents of badwords.txt?

wperron commented 3 years ago

Deno seems to confirm the contents of badwords.txt in S3 with the validation of the module name. Is there a way for me to check the contents of badwords.txt?

Not currently, the badwords.txt is stored on a private s3 bucket with public access blocked https://github.com/denoland/deno_registry2/blob/main/terraform/main.tf#L110-L134

TomokiMiyauci commented 3 years ago

Yeah, I checked.

putting it in the repository gives it a lot of visibility, we can link to the file in the README for example.

Do you plan to release the file?

wperron commented 3 years ago

Not at the moment, see Luca's answer above

TomokiMiyauci commented 3 years ago

@wperron Thank you for answering