HTMLization from Markdown truncates URLs

btbonval commented 9 years ago

Following from https://github.com/FinalsClub/karmaworld/issues/396#issuecomment-75189633 and https://github.com/FinalsClub/karmaworld/issues/396#issuecomment-75189910

In markdown, an image might look like this (including line break):

![](https://lh4.googleusercontent.com/m7DjSVOLh6Si5TlMQoz3O64
 -j2pB5CFIRKgniUwBkrR-PgIym3LwlEmmUZeafDNyLJSkuOSWgwIRdZsCiszmgUICC3tJu_KyQhtHa
 2NoT_197hGQCBEXJtkDVVPe008M)

 * * *

 FICTION

When converted to HTML, everything after the line break is truncated, leading to an invalid URL:

<p><img src="https://lh4.googleusercontent.com/m7DjSVOLh6Si5TlMQoz3O64"></p> 
 <hr>
 <p>FICTION  </p>

btbonval commented 9 years ago

markdown to HTML conversion happens at foremost here: https://github.com/FinalsClub/karmaworld/blob/note-editing/karmaworld/apps/notes/migrations/0020_markdown_to_html.py#L18

btbonval commented 9 years ago

It looks like markdown.markdown is causing the problem. It treats the linebreak as a space, which it then converts to the image title.

>>> testcase = """
... ![](https://lh4.googleusercontent.com/m7DjSVOLh6Si5TlMQoz3O64
...  -j2pB5CFIRKgniUwBkrR-PgIym3LwlEmmUZeafDNyLJSkuOSWgwIRdZsCiszmgUICC3tJu_KyQhtHa
...  2NoT_197hGQCBEXJtkDVVPe008M)
... 
...  * * *
... 
...  FICTION
... """
>>> import markdown
>>> testcase1 = markdown.markdown(testcase)
>>> print testcase1
<p><img alt="" src="https://lh4.googleusercontent.com/m7DjSVOLh6Si5TlMQoz3O64" title="-j2pB5CFIRKgniUwBkrR-PgIym3LwlEmmUZeafDNyLJSkuOSWgwIRdZsCiszmgUICC3tJu_KyQhtHa 2NoT_197hGQCBEXJtkDVVPe008M" /></p>
<hr />
<p>FICTION</p>

btbonval commented 9 years ago

Thankfully this problem is finite. I can search through all markdown for image tags on staging and production databases, review each case and to see how many use spacing or line breaks in the middle of the URL, and decide how to proceed from that. Might be easier said than done.

btbonval commented 9 years ago

31 notes have markdown URLs on the staging database.

=# SELECT count(note_id) FROM notes_notemarkdown WHERE markdown LIKE '%![](%';
 count 
-------
    31
(1 row)

=# SELECT count(note_id) FROM notes_notemarkdown WHERE markdown ~ '.*!\[.*\]\(.+\).*';
 count 
-------
    31
(1 row)

btbonval commented 9 years ago

Most of the notes start with ![](https://xx.googleusercontent.com/string.

Some notes start with ![](data:image/png;base64,string.

In both cases, string has a line break at an arbitrary line width of about 62 characters.

Three notes start with other content. Two had line-broken googleusercontent img URLs and no other img URLs scattered in the content. The third had strictly base64 encoded images scattered in the content.

Let's ensure production has a similar situation.

btbonval commented 9 years ago

Production is slightly different, but fewer cases total.

=> SELECT count(note_id) FROM notes_notemarkdown WHERE markdown LIKE '%![](%';
 count 
-------
    18
(1 row)

=> SELECT count(note_id) FROM notes_notemarkdown WHERE markdown ~ '.*!\[.*\]\(.+\).*';
 count 
-------
    24
(1 row)

btbonval commented 9 years ago

I don't see arbitrary line breaks at 62 characters.

I do see at least one case where the URL has a line break after the .com/ bit.

Nearly all cases have embedded images, rather than an image at the front of the markdown, which will take a little longer for me to parse by hand.

btbonval commented 9 years ago

Line breaks in the URLs happen at 80 characters on production. Some notes even have another 80 characters of white space before the line break, and then continue the URL on the new line. A few have line breaks in the title part ![titlepart](urlpart), where a title is actually supplied (along with the url having line breaks). Occasioanlly no spaces at all between one ![]() and the next ![](), don't know if that will cause problems.

Most URLs are googleusercontent URLs. A few with docs.google.com image links.

Found one example where the entire image URL is embedded with no line breaks. Just the one example, though. https://www.karmanotes.org/note/Brown/soc1315-macro-organizational-theory-organizations-in-social-context-728/2-open-systems-and-organizational-fieldsdocx

Need to test if markdown.markdown is fine with no spaces before ![]().

At any rate, the line breaks appear consistent and I see no utilization of whitespace to separate URL from title, which is the mistake that markdown.markdown keeps making.

btbonval commented 9 years ago

Colliding ![]() into each other without spacing appears fine.

>>> import markdown
>>> testtext = """
... some text
... text running into an image![](https://www.google.com/images/srpr/logo11w.png)
... 
... an image running into an image
... 
... ![](https://www.google.com/images/srpr/logo11w.png)![](https://www.google.com/images/srpr/logo11w.png)![](https://www.google.com/images/srpr/logo11w.png)
... """
>>> markdown.markdown(testtext)
u'<p>some text\ntext running into an image<img alt="" src="https://www.google.com/images/srpr/logo11w.png" /></p>\n<p>an image running into an image</p>\n<p><img alt="" src="https://www.google.com/images/srpr/logo11w.png" /><img alt="" src="https://www.google.com/images/srpr/logo11w.png" /><img alt="" src="https://www.google.com/images/srpr/logo11w.png" /></p>'

prettified html:

<p>some text
text running into an image<img alt="" src="https://www.google.com/images/srpr/logo11w.png" /></p>
<p>an image running into an image</p>
<p><img alt="" src="https://www.google.com/images/srpr/logo11w.png" /><img alt="" src="https://www.google.com/images/srpr/logo11w.png" /><img alt="" src="https://www.google.com/images/srpr/logo11w.png" /></p>

btbonval commented 9 years ago

Nothing obvious in markdown.markdown or markdown.Markdown doctext for relevant options that might handle this situation. So we should filter the markdown before sending to markdown.markdown to clean up dangling URLs.

Since the markdown is a single line string with \ns scattered around, re should suffice to find img URL links across line breaks using !\[.*\]\([^)]+\).

>>> print negative
![](something-cheese-this-way-comes)
>>> print testtext

![](big-long-string-that-is-not           
terminated-until-here)

>>> testtext
'\n![](big-long-string-that-is-not           \nterminated-until-here)\n'
>>> expr = re.compile('!\[.*\]\([^)]+\)')
>>> expr.findall(negative)
['![](something-cheese-this-way-comes)']
>>> expr.findall(testtext)
['![](big-long-string-that-is-not           \nterminated-until-here)']

btbonval commented 9 years ago

>>> whitespace = re.compile('[\s]+')
>>> whitespace.sub('', expr.findall(testtext)[0])
'big-long-string-that-is-notterminated-until-here'
>>> whitespace.sub('', expr.findall(negative)[0])
'something-cheese-this-way-comes'
>>> testtext = """
... ![](this-is-a-long-piece-of-te
... xt-that-will-never-end-unti
... l-the-end-of-time)
... """
>>> whitespace.sub('', expr.findall(testtext)[0])
'this-is-a-long-piece-of-text-that-will-never-end-until-the-end-of-time'

btbonval commented 9 years ago

This seems to work, with a little help from http://stackoverflow.com/questions/19205979/python-regular-expression-replace-withing-a-group

>>> expr = re.compile('!\[.*\]\(([^)]+)\)')
>>> whitespace = re.compile('[\s]+')
>>> def remove_whitespace(match):
...     return whitespace.sub('', match.group(0))
>>> testtext
'\nthis is some text\n\n![stuff](this-is-a-broken-\nline-of-URL-that-spans-acr\noss-multiple-lines)![more](\nyet-another-example-of-an\n-annoying-line-break)\n\n![thing](coolio)\n'
>>> print testtext

this is some text

![stuff](this-is-a-broken-
line-of-URL-that-spans-acr
oss-multiple-lines)![more](
yet-another-example-of-an
-annoying-line-break)

![thing](coolio)

>>> result = expr.sub(remove_whitespace, testtext)
>>> print result

this is some text

![stuff](this-is-a-broken-line-of-URL-that-spans-across-multiple-lines)![more](yet-another-example-of-an-annoying-line-break)

![thing](coolio)

>>>

btbonval commented 9 years ago

whoa, before I implement this, maybe I should remove the ! and apply it to all links, just in case?

I should probably find an explicit test case and ensure there is a problem first.

btbonval commented 9 years ago

staging:

=# SELECT count(*) FROM notes_notemarkdown WHERE markdown ~ '[^!]\[.+\]\(.+\)';
 count 
-------
     5
(1 row)

production:

=> SELECT count(*) FROM notes_notemarkdown WHERE markdown ~ '[^!]\[.+\]\(.+\)';
 count 
-------
    13
(1 row)

Should look some of these up, make sure there are no issues with them or that this operation will work as cleanly on them as images.

btbonval commented 9 years ago

Found an example link on this page: http://karmanotes-beta.herokuapp.com/note/Brown/management-of-industrial-and-nonprofit-organizations/koemei-us-college-education-report-12-19-342640 youtube-video-insight is broken into about four lines.

It renders fine, but will it blend? Appears not, includes \n into the HREF links.

>>> testtext = """
... stuff in a [this is a link with spaces](http://line.breaks
... /will/defeat/the/mot
... herland/until/we/ri
... se/against)
... 
... and [other links](too)
... """
>>> print markdown.markdown(testtext)
<p>stuff in a <a href="http://line.breaks
/will/defeat/the/mot
herland/until/we/ri
se/against">this is a link with spaces</a></p>
<p>and <a href="too">other links</a></p>

btbonval commented 9 years ago

removing ! from the expr regex, I notice that whitespaces are being removed not just from the URL matches but also from the link text. Only the URL is grouped.

>>> print expr.sub(remove_whitespace, testtext)

stuff in a [thisisalinkwithspaces](http://line.breaks/will/defeat/the/motherland/until/we/rise/against)

and [otherlinks](too)

>>>

btbonval commented 9 years ago

Treat both bunches of text as groups. In the replacement, only modify the URL. Piece together the whole tag. Prepending of ! won't change any of the behavior, so it should apply equally to URLs and images.

expr = re.compile('\[(.*)\]\(([^)]+)\)')
def remove_whitespace_from_url(match):
    parms = list(match.groups())
    parms[1] = whitespace.sub('', parms[1])
    return "[{0}]({1})".format(*parms)

btbonval commented 9 years ago

images with anchors links ... I don't care that much to write a full on recursive markdown processor. They will probably break, but I think they are minimal.

e.g. [ ![](img src) ](img a href)

btbonval commented 9 years ago

just to test images, links, and linked images. linked images fail but only moderately.

>>> testtext = """
... this is some text
... 
... this [is a link](http://what.the.heck/
... am/I/doing/to
... /myself)
... in poor form.
... 
... this is an image
... 
... ![](http://good.news/everyone
... /ive/invented/a/time/
... machine)
... 
... this is a linked image
... [ ![](http://good.news/everyone
... /ive/invented/a/time/
... machine)](http://what.the.heck/
... am/i/doing/to
... /myself)
... """
>>> print expr.sub(remove_whitespace_from_url, testtext)

this is some text

this [is a link](http://what.the.heck/am/I/doing/to/myself)
in poor form.

this is an image

![](http://good.news/everyone/ive/invented/a/time/machine)

this is a linked image
[ ![](http://good.news/everyone/ive/invented/a/time/machine)](http://what.the.heck/ 
am/i/doing/to
/myself)

>>> print markdown.markdown(expr.sub(remove_whitespace_from_url, testtext))
<p>this is some text</p>
<p>this <a href="http://what.the.heck/am/I/doing/to/myself">is a link</a>
in poor form.</p>
<p>this is an image</p>
<p><img alt="" src="http://good.news/everyone/ive/invented/a/time/machine" /></p>   
<p>this is a linked image
<a href="http://what.the.heck/
am/i/doing/to
/myself"> <img alt="" src="http://good.news/everyone/ive/invented/a/time/machine" /></a></p>
>>>

btbonval commented 9 years ago

in a linked image, the image src is completely well handled, so an image will appear. The link will have \ns in it, but testing in firefox with a quick little html file, it looks like those are ignored and the link works fine.

btbonval commented 9 years ago

Reverting data migrations in my local database.

dbo9ndp1gkos53=# DELETE FROM south_migrationhistory WHERE id IN (113,114);
DELETE 2
dbo9ndp1gkos53=# UPDATE notes_notemarkdown SET html=null;
UPDATE 956
dbo9ndp1gkos53=# DELETE FROM notes_notemarkdown WHERE LENGTH(COALESCE(markdown,'')) = 0 AND LENGTH(COALESCE(html,'')) = 0;
DELETE 57

btbonval commented 9 years ago

one particular note has \xa0 randomly tossed in, which is messing with re and causing errors. It seems to be used as spacing. I think replacing this with   might be a decent idea.

Boom! That's exactly what is meant to do. http://www.codetable.net/hex/a0

btbonval commented 9 years ago

The full error is UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128). I wonder if I can switch this to using utf-8. I'm just not sure what "this" is. I assume re, but I'm not sure.

btbonval commented 9 years ago

The problem was in remove_whitespace_from_url. Python2 != Python3. Have to explicitly make the string unicode.

(Pdb) print 'Hi{0}'.format(u'\xa0')
*** UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
(Pdb) print u'Hi{0}'.format(u'\xa0')
Hi

btbonval commented 9 years ago

Image works on this page /note/harvard/reason-and-faith-in-the-west/imagejpg-4-23-317946.

links are working on this page /note/harvard/metaphysical-poetry/april-6doc-3-14-9376, though it looks like markdown code is not being converted to <pre> tags. That's beyond my pay grade for this ticket.

Images work on this page /note/harvard/cs-50-114/1st-lecture, but links are not clickable. Might be a result of pdf2htmlEX being involved.

I can't find any examples in the staging database of a linked image. I thought I had earlier, but my SQLfoo is failing for now. They might or might not work.

Honestly, I think this looks good as is. A huge improvement.

FinalsClub / karmaworld

HTMLization from Markdown truncates URLs #404