freedomofpress / securethenews

An automated scanner and web dashboard for tracking TLS deployment across news organizations
https://securethe.news
GNU Affero General Public License v3.0
102 stars 25 forks source link

Unicode slugs #48

Closed garrettr closed 7 years ago

garrettr commented 7 years ago

Fixes an issue that we encountered with tass.ru,ТАСС from news_sites.csv. Since ТАСС is all non-ASCII characters, Django's built-in slugify generated an empty slug for it. This issue came to light due to #47, manifesting as a 500 error, because Site.get_absolute_url for ТАСС tried to call reverse with an empty slug, which obviously doesn't work.

This PR:

  1. Uses Mozilla's unicode-slugify to generate Unicode-aware slugs.
  2. Validates that slug != '' before saving Sites to prevent this bug from popping up again in the future.

What do the Unicode URLs look like?

At the moment, the slugs will contain Unicode characters. Note that URLs are restricted to ASCII characters according to the spec; however, it is common practice to percent-encode Unicode characters for use in URLs, and all modern browsers handle this correctly. Here's what it looks like:

screen shot 2016-11-18 at 1 47 45 pm

If you copy-paste the URL, you will see it is actually percent-encoded "behind the scenes": http://localhost:8000/sites/%D1%82%D0%B0%D1%81%D1%81.

Potential alternative

The other alternative is to transliterate non-ASCII characters to ASCII in the manner of unidecode. This can be easily achieved with django-unicode by passing only_ascii=True to slugify.

I am not sure which solution is best, but I do think we should make a final decision about this before the site is launched - changing this after launch will cause URLs to break, and I'd prefer to avoid dealing with setting up redirects for these pages if I can.

conorsch commented 7 years ago

Going to rebase this on latest master and take it for a spin on the staging machine.

conorsch commented 7 years ago

Rebased on top of latest master.

conorsch commented 7 years ago

Errored out trying to pull in the new dependency:

TASK [freedomofpress.django-stack : Setup virtualenv from requirements file] ***
task path: /home/conor/freedomofpress/infrastructure/roles-external/freedomofpress.django-stack/tasks/django_service.yml:30
Friday 18 November 2016  15:24:08 -0800 (0:00:17.116)       0:01:31.764 ******* 
fatal: [securethe.news]: FAILED! => {"changed": false, "cmd": "/home/gcorn/securethenews/bin/pip install -r /var/www/django/securethenews/requirements.txt", "failed": true, "msg": "stdout: Requirement already satisfied (use --upgrade to upgrade): Django>=1.10,<1.11 in /home/gcorn/securethenews/lib/python3.4/site-packages (from -r /var/www/django/securethenews/requirements.txt (line 1))\nRequirement already satisfied (use --upgrade to upgrade): gunicorn==19.6.0 in /home/gcorn/securethenews/lib/python3.4/site-packages (from -r /var/www/django/securethenews/requirements.txt (line 2))\nRequirement already satisfied (use --upgrade to upgrade): Pillow==3.4.2 in /home/gcorn/securethenews/lib/python3.4/site-packages (from -r /var/www/django/securethenews/requirements.txt (line 3))\nRequirement already satisfied (use --upgrade to upgrade): psycopg2==2.6.2 in /home/gcorn/securethenews/lib/python3.4/site-packages (from -r /var/www/django/securethenews/requirements.txt (line 4))\nDownloading/unpacking unicode-slugify==0.1.3 (from -r /var/www/django/securethenews/requirements.txt (line 5))\n  Downloading unicode-slugify-0.1.3.tar.gz\n  Running setup.py (path:/tmp/pip-build-8jv3iwpg/unicode-slugify/setup.py) egg_info for package unicode-slugify\n    Traceback (most recent call last):\n      File \"<string>\", line 17, in <module>\n      File \"/tmp/pip-build-8jv3iwpg/unicode-slugify/setup.py\", line 7, in <module>\n        long_description=open('README.md').read(),\n      File \"/home/gcorn/securethenews/lib/python3.4/encodings/ascii.py\", line 26, in decode\n        return codecs.ascii_decode(input, self.errors)[0]\n    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 361: ordinal not in range(128)\n    Complete output from command python setup.py egg_info:\n    Traceback (most recent call last):\n\n  File \"<string>\", line 17, in <module>\n\n  File \"/tmp/pip-build-8jv3iwpg/unicode-slugify/setup.py\", line 7, in <module>\n\n    long_description=open('README.md').read(),\n\n  File \"/home/gcorn/securethenews/lib/python3.4/encodings/ascii.py\", line 26, in decode\n\n    return codecs.ascii_decode(input, self.errors)[0]\n\nUnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 361: ordinal not in range(128)\n\n----------------------------------------\nCleaning up...\nCommand python setup.py egg_info failed with error code 1 in /tmp/pip-build-8jv3iwpg/unicode-slugify\nStoring debug log for failure in /home/gcorn/.pip/pip.log\n"}

Looks like the setup.py for the module tries to parse the README for a docstring and that's throwing run-of-mill Python string handling errors. Looks like this reverted commit is relevant (it's the latest in the upstream repo): https://github.com/mozilla/unicode-slugify/commit/02c20066e4af8fc5c929a19301f0d8e4bbba629b

garrettr commented 7 years ago

@conorsch Yo dawg I heard you like UnicodeDecodeErrors so I... breaks down in tears

garrettr commented 7 years ago

Interestingly, I did not encounter this error when I ran pip3 install unicode-slugify in my dev VM. There is an open PR for this on the upstream repo (https://github.com/mozilla/unicode-slugify/pull/22), I pinged the maintainer so we'll see if they respond.

In the meantime, maybe there's a potential workaround based on https://github.com/mozilla/unicode-slugify/issues/16#issuecomment-125408979?

conorsch commented 7 years ago

Hmm, I was able to run pip install -r requirements.txt as gcorn inside /var/www/django/securethenews/ interactively on the server, after sourcing the virtualenv, so looks like this may an issue with the role. Will poke around more.

garrettr commented 7 years ago

Re: the "use unicode vs. transliterate unicode" question, it occurs to me that a benefit of transliterating the URLs to ASCII-only is that it makes them easier to type on standard keyboards. I'm not sure if this is actually a benefit or a misapprehension due to my American/native-English bias.

conorsch commented 7 years ago

The fact that URLs support a richer character set than ASCII is reason enough to try to support it. After all, in the context of STN, absolutely no one is going to type out an entire URL to browse to the site—they're going to click on it, possibly share it via email. Let's see what we can do to get it Working Right™, before we fall back into the safety of ASCII-only, with our hands over our ears.

garrettr commented 7 years ago

Ok, Harris pointed out that as of recent versions of Django, django.utils.text.slugify takes an allow_unicode argument, which passes through Unicode correctly. This achieves the same outcome that we had originally achieved with unicode-slugify, but avoids adding an additional dependency (which is causing deployment problems due to https://github.com/mozilla/unicode-slugify/pull/22, and our read of the activity on that repo is that we should not block on them fixing it).

I have tested this locally and confirmed that it is ready to merge.

Two things to remember about this change:

  1. Always derive URLs from Django so the Unicode slugs are handled correctly. This means: use reverse, use {% url %}, and use get_absolute_url. Don't construct URLs by concatenating strings - eventually it will lead to sorrow.
  2. Slugs are generated when models are saved. If you already have a database and you apply these code changes, your slugs will not automatically be fixed. You can fix them by re-saving all the sites using the new code, since Unicode characters were always preserved in Site.name. In addition, all future loads of sites (using ./manage.py loadsites, introduced in aeffb81), will generate unicode-aware slugs.
conorsch commented 7 years ago

@garrettr Great summary. You should post your resolution on the upstream unicode-slugify repo (https://github.com/mozilla/unicode-slugify/pull/22), to aid future searchers in the same vein.