FinalsClub / karmaworld

KarmaNotes.org v3.0
GNU Affero General Public License v3.0
7 stars 6 forks source link

divs/spans break WYSIWYG in FireFox #406

Open btbonval opened 9 years ago

btbonval commented 9 years ago

Particular test case: /note/harvard/reason-and-faith-in-the-west/resumepdf-5-27-43979

That note allows me to open the editor, but it's all raw HTML in the editor. If I try to embolden some plain text, it just removes the editor window, refreshes the page, and nothing changed.

No errors on console besides TypeError: m.getComputedStyle(...) is null ... wysihtml5x-toolbar.min.js:5, but this error is relatively common on pages that work just fine. No errors on the server console.

This problem was originally noticed in #402 but I conflated a few different problems together in that ticket. I thought they might have all been resolved, but it appears not to be the case.

btbonval commented 9 years ago

Looking at the raw code for that link, it's a mess of empty divs and spans that are arbitrarily scattered throughout the text.

btbonval commented 9 years ago

Found another example in the database which has divs and cannot be edited. /note/Brown/management-of-industrial-and-nonprofit-organizations/finalsclub-hccg-reportpdf-5-20-137857

=# SELECT slug FROM notes_notemarkdown INNER JOIN notes_note ON notes_notemarkdown.note_id = notes_note.id WHERE notes_note.category = 'LECTURE_NOTES' AND notes_notemarkdown.html LIKE '%<div>%';
                 slug
---------------------------------------
 resumepdf-5-24-675181
 finalsclub-hccg-reportpdf-5-20-137857
 resumepdf-5-17-387450
 resumepdf-5-24-233477
 resumepdf-5-27-43979
(5 rows)

this might be unrelated to pdf2htmlex, since I don't see any particular markers about /* blahblah pdf2htmlEX */ which I see in some other HTML in the database.

btbonval commented 9 years ago

The upstream static file is full of random divs and spans. https://karma-beta.s3.amazonaws.com/html/finalsclub-hccg-reportpdf-5-20-137857.html

Given the new sanitizer and upload process, it will be worth trying to upload the PDF as a new note and see if the problem persists. If not, it might be good to download the upstream PDFs and run them through the usual upload process if that is feasible.

If that involves pushing to gdrive and pulling from gdrive and running through pdf2htmlex and sanitizing and sanitizing, that might not be worthwhile as a data migration.

At that point, it might be worthwhile to post process notes that contain span/div in the database after the data migration is completed.

btbonval commented 9 years ago

queue appears to have finished, but I can't upload any notes. Nothing seems to be happening with celery worker or beat when I upload a new note.

btbonval commented 9 years ago

alright, closed a bunch of my zombie AMQP connections and added the BROKER_POOL_LIMIT=0 as per #392 in my local environment and on the staging system (although beta's change will not take effect until it is restarted). That seems to have helped, some text notes I uploaded got processed finally. Still nothing from the PDF uploads.

btbonval commented 9 years ago

ah! but the plain text file that I just uploaded has divs and spans in it! Absolutely confirmed this is not due to pdf2htmlEX.

btbonval commented 9 years ago

Yeah, this uploaded text file with spans cannot be edited either. The WYSIWYG seems non functional, showing straight HTML.

It might be time to follow the work flow and see where those span tags came in.

btbonval commented 9 years ago

I can delete content in the straight HTML editor mode. I deleted all spans, but left <p>blah</p> in place. I saved the file and edited it again. The <p>s were no longer explicitly visible, but were rendered as they are supposed to be. I was able to edit the contents of the file and add bolding.

So the WYSIWYG is back to functional once I remove the <span> tags.

btbonval commented 9 years ago

I added the <span>...</span> tag around something and saved it. It escaped the span with &lt; and &gt;, which is good and expected.

I went to edit the note again, and the editor is back to plain/text mode, no WYSIWYG, and the buttons do not work.

btbonval commented 9 years ago

Removed <span>...</span> and editor is back to WYSIWYG.

Added <a href="place">text</a>;. Again, the tags are escaped. Went back to WYSIWYG. It's still WYSI, but <a href=" and </a> are gone from the WYG, while place">text; remains. Saving it, those escaped tags are totally gone.

This editor is not idempotent or else we are interfering with the results somehow.

btbonval commented 9 years ago

It's google docs that's adding a ton of spans and divs everywhere.

pdb in convert_raw_document (secret: run celery worker with --pool solo) let me interrogate some of the details.

[2015-02-24 01:58:38,277: WARNING/MainProcess] (Pdb) content_dict['html']
[2015-02-24 01:58:42,963: WARNING/MainProcess] '<html><head><title>PS3EYE4.txt</title><meta content="text/html; charset=UTF-8" http-equiv="content-type"><style type="text/css">ol{margin:0;padding:0}.c0{vertical-align:baseline;color:#000000;
...
h5{widows:2;padding-top:11pt;line-height:1.15;orphans:2;text-align:left;color:#000000;font-size:11pt;font-family:"Arial";font-weight:bold;padding-bottom:2pt;page-break-after:avoid}h6{widows:2;padding-top:10pt;line-height:1.15;orphans:2;text-align:left;color:#000000;font-size:10pt;font-family:"Arial";font-weight:bold;padding-bottom:2pt;page-break-after:avoid}</style></head><body class="c3"><p class="c1"><span class="c0">00: 640x480@15</span></p><p class="c1"><span class="c0">01: 640x480@30</span></p><p class="c1"><span class="c0">02: 640x480@40</span></p><p class="c1"><span class="c0">03: 640x480@50</span></p><p class="c1"><span class="c0">04: 640x480@60</span></p><p class="c1"><span class="c0">10: 320x240@30</span></p><p class="c1"><span class="c0">11: 320x240@40</span></p><p class="c1"><span class="c0">12: 320x240@50</span></p><p class="c1"><span class="c0">13: 320x240@60</span></p><p class="c1"><span class="c0">14: 320x240@75</span></p><p class="c1"><span class="c0">15: 320x240@100 (added 15/02/09 V0.3)</span></p><p class="c1"><span class="c0">16: 320x240@125 (added 15/02/09 V0.3)</span></p><p class="c1 c2"><span class="c0"></span></p><p class="c1"><span class="c0">sudo modprobe gspca_ov534 videomode=04</span></p><p class="c1 c2"><span class="c0"></span></p></body></html>'

This is just a simple text file, not even RTF, but Google Docs (which yields the content_dict above) is inserting a span inside each p.

btbonval commented 9 years ago

Alright I need to let this cook on the back burner.

Do we want to strip divs and spans? Will that mess everything up?

I swear I've searched the junk out of the WYSIHTML issues and found nothing about problems with span and div, in fact, spans are used for color according to one issue. Perhaps this is an upstream issue, or an integration issue?

btbonval commented 9 years ago

Also the <TITLE>title</TITLE> is getting concatenated on the front of the document. I'm pretty sure this is a direct result of the sanitizer stripping all tags around the title, but leaving the title text in place.

I kind of like this happy accident, but it should be wrapped with header formatting or something.

btbonval commented 9 years ago

Interestingly, WYSIHTML notes in its Readme that it supports

Graceful Degradation: Users with other browsers will see the textarea and are still able to write plain HTML by themselves.

This sounds almost exactly like the situation when a span or div is present, although that should have nothing to do with browser version.

btbonval commented 9 years ago

Toying with the example found at http://www.wysihtml.com/ . There's a button to switch to raw HTML. When using colors, they have spans with properties. I decided to remove properties from all of the spans so it was <span>text</span>. Switching back to WYSI mode, all the color was removed but everything was as it should be. I was able to add bolding without any issue.

Somehow our implementation of WYSIHTML is flawed, because it cannot handle spans in this way.

btbonval commented 9 years ago

There are different rulesets. This might somehow be a difference. We're using /parser_rules/advanced_and_extended.js.

It isn't clear what the front page http://www.wysihtml.com/ example is using.

http://voog.github.io/wysihtml/examples/simple.html is clearly using advanced_and_extended.js (from looking at network traffic). Switching to HTML view, I'm able to add <span>text</span> and switch back to WYSI mode, no problems.

I notice it says

Uses a custom rule set that allows the following elements: strong, b, em, i, a, span

I'm curious if that refers to the parser_rules (which are provided with the repo, so I don't imagine they are custom), or if there is some kind of switch that controls acceptable elements.

There's a span definition in the parser we're using. https://github.com/FinalsClub/karmaworld/blob/note-editing/karmaworld/apps/wysihtml5/static/wysihtml5/wysihtml-0.4.17/parser_rules/advanced_and_extended.js#L429-L443

I don't see anything wrong there.

btbonval commented 9 years ago

I injected a WYSI/HTML toggle button. I cannot use it to to HTML, as that action triggers a whole page refresh.

This is oddly familiar from when I see HTML but try to use buttons. Normally, in WYSIHTML's proper HTML view, the buttons are disabled. When I see the accidental HTML view due to a span, the buttons are enabled but using them causes a refresh (as if, perhaps, it is trying to switch from HTML to WYSI mode?).

Might be a red herring, but I'm getting the strong impression that this entire page reload thing is interfering with my troubleshooting.

btbonval commented 9 years ago

When I click Save, I get a 302 POST back to the page in the network traffic. That's expected of a submit button. The POST causes a page refresh since the URL has technically changed.

When I'm seeing HTML view, highlight something, and click B to embolden it, sure enough I see 302 POST back to the page in the network traffic. That's the cause of the page refresh. Why that causes a POST is ... what I have to deduce.

Then again it might be going down the way wrong path. The spans shouldn't cause HTML view in the first place.

btbonval commented 9 years ago

I noticed that textareaClone is undefined here when HTML view happens and the WYSI is broken: https://github.com/FinalsClub/karmaworld/blob/note-editing/karmaworld/templates/notes/note_base.html#L81

Unfortunately it seems to be undefined when WYSI is working. ugh. grasping at straws. Although it does beg the question why have it?

btbonval commented 9 years ago

I'm getting very odd behavior trying to drop breakpoints in the note_base code. Some breakpoints drop down a few lines when I click them. Any breakpoint added prior to line 81 highlighted above doesn't stop execution at all.

I think I'm going to need to switch to Chrome to poke at this better.

btbonval commented 9 years ago

Google Chrome 40.0.2214.111 on Mint Linux lets me debug all the points the way I would expect.

Yet, the same version does not break on spans. The WYSIWYG is working just fine in Chrome! Kind of. So here's the weird thing: <span> as typed into the WYSIWYG is escaped in the actual note display, but is treated as actual HTML in the editor when going back into it. Going back and forth enough times, the spans disappeared entirely.

FireFox 35.0.1 on Mint Linux does not show the WYSIWYG properly when spans are present. Yet the main website works fine on the same FireFox browser.

btbonval commented 9 years ago

Okay, strange interaction.

In Google, I enclose text with <span> in the WYSI. I reload the page in FireFox, where the spans are visible as escaped in the note. I edit the note, editor works just fine and the span is completely ignored and missing from the WYSI (as desired)! I save the note without changes in FireFox, and the escaped <span> is gone from the note view. Reload the page and click edit, and the <span> is gone from the editor as well. Totally gone.

In FireFox, I enclose text with <span> in the WYSI. I reload the page in FireFox, where the spans are visible as escaped in the note. I edit the note, editor shows HTML and is broken. I save the note (even though it is showing HTML and stuff), and the escaped <span> is gone from the note view. Reload the page and click the editor again, but the <span> is still there breaking the WYSI.

In Chrome, this whole process works start to finish.

The last of the quadfecta:

In FireFox, I enclose text with in the WYSI. I reload the page in Chrome, where the spans are visible as escaped in the note. I edit the note, editor works just fine and the span is completely ignored and missing from the WYSI (as desired)! I save the note without changes in Chrome, and the escaped is gone from the note view. Reload the page and click the editor again, but the is still there breaking the WYSI.

How can entering <span> in one browser or the other make such a huge difference?!

btbonval commented 9 years ago

The first test case in the original comment is also editable in Chrome without any issue.

btbonval commented 9 years ago

Maybe the database will reveal a difference?

No spans. break this stuff<br>...

FireFox added span. &lt;span&gt;break this stuff&lt;/span&gt;<br>...

Chrome added span. &lt;span&gt;break this stuff&lt;/span&gt;<br>...

So why is FireFox okay if Chrome adds the span but not FireFox?

btbonval commented 9 years ago

Maybe there's something being cached somehow in FireFox? If Chrome adds the span, FireFox would not have the cache. If FireFox adds it, it would have the cache.

Close all private windows. Start a new private window with a fresh cache. Login. Edit the note to add spans, save.

Close all private windows. Start a new private window with a fresh cache. Login. Edit the note. Broken WYSI, flat HTML showing.

Alright so it isn't caching.

btbonval commented 9 years ago

I am running Chrome on a separate computer, using a 10.x.x.x network address to access the site. Meanwhile I'm running FireFox on the same computer, using a 127.0.0.1 network address to access the site.

Let's try FireFox on my separate computer using 10.x.x.x. Separate computer is also running FF 35.0.1 on Linux Mint, so that is consistent. New private window. Login to site running at 10.x.x.x. Visiting page with <span> in place. Edit this note. Broken WYSI with flat HTML.

Not related to networking.

btbonval commented 9 years ago

FF 35.0.1 on Windows 8. WYSI is broken due to SPAN. Not limited to Linux FF 35.0.1.

Chrome 40.0.2214.94 on Windows 8. WYSI works as expected even with SPAN.

IE 11.0.9600.17498 on Windows 8. WYSI works as expected even with SPAN.

FireFox just updated to 36.0, so let's give that a go.

FF 36.0 on Windows 8. WYSI is broken due to SPAN.

Alright, so FF, either version 35.0.1 or 36.0 on either Windows or Linux, does not use the WYSIHTML properly if there is a span in the text on our site. However, the wysihtml.com example works just fine with divs and spans in FF.

This is going to be a very specific bug to try tracking down at this point. Need to check in with @AndrewMagliozzi to see if this worth spending any more time on.