Smileyt / python-markdown2

Automatically exported from code.google.com/p/python-markdown2
Other
0 stars 0 forks source link

problems with bidi text? encodings etc. #3

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago

[Trent, discussing markdown(1).py]
> Are there any test cases for that in the python-markdown repo? I'd
> appreciate a specific example. I'm relatively comfortable with unicode
> -- but haven't had any real experience with bidi text.

[Yuri]
> Yes, in tests/misc.  See japanese.txt, russian.txt, bidi.txt.

Take a look at those tests. Get the test suite to use those tests as well?

Original issue reported on code.google.com by tre...@gmail.com on 3 Nov 2007 at 6:04

GoogleCodeExporter commented 8 years ago
I've just encountered a problem with processing unicode strings but I'm not sure
whether to file a new issue or leave it here.

This example with Russian characters: 

    markdown(u'## Тест')

breaks with this traceback:

    Traceback (most recent call last):
      File "un.py", line 4, in <module>
        u = markdown(u'## Тест')
      File "markdown2.py", line 113, in markdown
        link_patterns=link_patterns).convert(text)
      File "markdown2.py", line 198, in convert
        text = self._run_block_gamut(text)
      File "markdown2.py", line 418, in _run_block_gamut
        text = self._hash_html_blocks(text)
      File "markdown2.py", line 300, in _hash_html_blocks
        text = self._liberal_tag_block_re.sub(hash_html_block_sub, text)
      File "markdown2.py", line 1204, in result
        return function(*args + rest, **combined)
      File "markdown2.py", line 264, in _hash_html_block_sub
        key = '!'+md5.md5(g1).hexdigest()+'!' # see _escape_hash() above
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-7:
ordinal not in range(128)

Looking at the last line I can suppose that `g1` there is a unicode string 
which md5
tries to encode into a byte string (using ASCII codec which is Python's 
default) and
fails. To avoid this situation unicode strings should be encoded into utf-8
explicitly before all places that appear to require a byte string. May be the 
best
option would be to encode entire unicode input into utf-8, then process it and 
then
decode into unicode back on output.

Original comment by isagal...@gmail.com on 4 Nov 2007 at 10:38

GoogleCodeExporter commented 8 years ago
You can leave this on the same issue. I'l try to deal with unicode problems 
soon.

Original comment by tre...@gmail.com on 4 Nov 2007 at 5:02

GoogleCodeExporter commented 8 years ago
test case added in r80:

  cd test && ./test.py issue3

Original comment by tre...@gmail.com on 4 Nov 2007 at 10:15

GoogleCodeExporter commented 8 years ago
Fixed "Man..."'s issue (from comment 1) in r81. Leaving open to look at 
markdown.py's
unicode tests.

Original comment by tre...@gmail.com on 6 Nov 2007 at 5:24

GoogleCodeExporter commented 8 years ago
My gut tells me the right answer is to convert to unicode and do everything 
there --
rather than working in encoded text.

Original comment by tre...@gmail.com on 6 Nov 2007 at 7:14

GoogleCodeExporter commented 8 years ago
revision 88: Change to converting to unicode and doing work there. I believe 
this is
a better long-term solution for unicode issues in markdown2.py. I have also 
added a
TODO to try to find some more unicode edge case problems.

Still leaving open to look at the markdown.py unicode test cases.

Original comment by tre...@gmail.com on 7 Nov 2007 at 10:43

GoogleCodeExporter commented 8 years ago
Damn, it appears that this tracker doesn't send you mail if only commented an 
issue,
so I missed all the conversation :-)

But anyway I just updated to revision 93 and unicode input seems to work fine. 
And
unicode output is great. Thanks!

Original comment by isagal...@gmail.com on 10 Nov 2007 at 10:16

GoogleCodeExporter commented 8 years ago
markdown(u'## заголовок') passed

markdown(u'## заголовок\n\n    :::python\n    # 
комментарий', extras=['code-color'])
break with traceback

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "markdown2.py", line 128, in markdown
    link_patterns=link_patterns).convert(text)
  File "markdown2.py", line 228, in convert
    text = self._run_block_gamut(text)
  File "markdown2.py", line 558, in _run_block_gamut
    text = self._do_code_blocks(text)
  File "markdown2.py", line 1090, in _do_code_blocks
    return code_block_re.sub(self._code_block_sub, text)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 55: 
ordinal not
in range(128)

after i comment "issue3 hack" all these examples work properly

--- markdown2.py.orig   2007-12-03 11:28:55.000000000 +0300
+++ markdown2.py        2008-01-22 15:06:02.000000000 +0300
@@ -1067,7 +1067,7 @@
                 colored = self._color_with_pygments(codeblock, lexer)
                 # HACK for issue3: drop this when/if use unicode for all
                 # processing.
-                colored = colored.encode("utf-8")
+                #colored = colored.encode("utf-8")
                 return "\n\n%s\n\n" % colored

         codeblock = self._encode_code(codeblock)

markdown2 version 0.1a.dev_r97

Original comment by vsevolod.balashov on 22 Jan 2008 at 11:59

GoogleCodeExporter commented 8 years ago
@vsevolod.balashov

Thank you very much. Test case added in r115, fixed in r116.

(Note: still leaving open to look at incorporating test cases mentioned in the
opening comment.)

Original comment by tre...@gmail.com on 23 Jan 2008 at 4:45