CriticMarkup / CriticMarkup-toolkit

Various tools to use CriticMarkup in your daily workflow
730 stars 51 forks source link

Doesn't work with cyrillic texts #33

Open egalion opened 10 years ago

egalion commented 10 years ago

The current version doesn't work with cyrillic texts. It gives a Unicode error.

More specifically:

Unexpected Error:  <type 'exceptions.UnicodeDecodeError'>
Traceback (most recent call last):
  File "criticParser_CLI.py", line 348, in <module>
    h = markdown.markdown(h, extensions=['extra', 'codehilite', 'meta'])
  File "/usr/lib/python2.7/dist-packages/markdown/__init__.py", line 396, in markdown
    return md.convert(text)
  File "/usr/lib/python2.7/dist-packages/markdown/__init__.py", line 266, in convert
    source = unicode(source)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128). -- Note: Markdown only accepts unicode input!
Using the Markdown2 module for processing
/path-to-program/CriticMarkup-toolkit/CLI/1.html
Unexpected Error:  <type 'exceptions.UnicodeEncodeError'>
Traceback (most recent call last):
  File "criticParser_CLI.py", line 371, in <module>
    filesource.write(h)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3667-3670: ordinal not in range(128)

I found a workaround after some googling. It may not be very elegant, but it does the job. It applies to the command line tool criticParser_CLI.py. I am not a programmer, so maybe there is a better way to do it.

First, this section

#!/usr/bin/env python

import codecs
import sys
import os
import re
import argparse
import subprocess

should become

#!/usr/bin/env python

import codecs
import sys

reload(sys)
sys.setdefaultencoding('utf8')

import os
import re
import argparse
import subprocess

Then this section

jq = '''<!DOCTYPE html>
<html>
<head><script src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<title>Critic Markup Output</title>'''

head = '''<!DOCTYPE html>
<html>
<head>
<title>Critic Markup Output</title>'''

Should become

jq = '''<!DOCTYPE html>
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<head><script src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<title>Critic Markup Output</title>'''

head = '''<!DOCTYPE html>
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<head>
<title>Critic Markup Output</title>'''
teoric commented 9 years ago

This is not only a problem with Cyrillic text but with every text that is not just English or classical Latin (i.e. only uses ASCII). It would be enough to replace open(args.source, "r") with codecs.open(args.source, "r", encoding="UTF-8"), or even add an encoding parameter. This is a little less hacky than sys.setdefaultencoding('utf8').