FinalsClub / karmaworld

KarmaNotes.org v3.0
GNU Affero General Public License v3.0
7 stars 6 forks source link

testing data migrations of editable notes feature #396

Closed btbonval closed 9 years ago

btbonval commented 9 years ago

data migration is written in commit 67834daea5a8a9dcd07573a226faad1ee01aee54 but as yet untested.

needs to be tested against beta's database.

btbonval commented 9 years ago

of 131 notes that lack markdown, 57 were found statically hosted, downloaded, and saved into the database. The other 74 failed to be found or converted.

It's worth checking some of these failures to see if they can be found on the beta S3 but are not loaded due to encoding errors in the name, or if they really are missing.

btbonval commented 9 years ago

Many slugs reported missing do not appear to be in the note html upload. Such is the staging system, a wash of random things that come and go and incoherence.

The slug chinese-bb-class-notes was reported missing in static, but I do see chinese-bb-class-notes-2-4-####### in static. The database shows both the former and the latter as separate entries. One is right after the other. It looks like the former never uploaded but the latter did. The markdown appears to be some blank lines, so the HTML was converted from that. Checking the hosted content on S3, this looks like one of the old errored pages that didn't properly render. There is still more content there (about 12 meaningless characters) which are not in the markdown and thus not converted to HTML.

If there is static content, should that override the markdown?

btbonval commented 9 years ago

For confirmation, I looked for how many rows in note_markdown that had non-zero/non-null HTML, but zero/null markdown. There were 57. That lines up with the number of static files that were downloaded.

I also looked at the length of the HTML, and it was ... nothing I would not expect. It'll be hard to confirm this without merging into the largely complete edit branch.

btbonval commented 9 years ago

The HTML coming from the static hosted notes should be sanitized using @yourcelf 's sanitization code prior to being inserted into the database. Need to update the data migration for that.

btbonval commented 9 years ago

This will be a good comparison for downloading from static. http://127.0.0.1:8080/note/harvard/cs-50-114/1st-lecture

Unfortunately #397 is preventing me from loading the URL at the moment.

yourcelf commented 9 years ago

A quick note: if the NoteMarkdown custom save method is used in the migration, sanitization will happen automatically. But if the migration is using South's pseudo-orm, that save method won't be called, and its logic will have to be duplicated in the migration.

btbonval commented 9 years ago

@yourcelf South's ORM is in full effect. I might be able to import the models and call the correct functions in clever and tricky ways. So long as they don't touch the database, everything should work as expected.

yourcelf commented 9 years ago

Rather than get too fancy with importing I'd just copy/paste code in this case. The relevant parts is rather simple:

    if self.note.is_editable():
        self.html = sanitizer.sanitize_html_to_editable(self.html)
    else:
        self.html = sanitizer.sanitize_html_preserve_formatting(self.html)

(will need to copy out definition of is_editable too).

Django does so much magic in loading the models that I would be leery of importing them within South without consequences. Migrations are also, at least in theory, supposed to be a historical record of changes -- and if we change the save model in the future, it'd be ideal for old migrations to still do things the old way. This is one of the few cases where I think duplication is better than DRY.

btbonval commented 9 years ago

I was meaning to call the code statically, but if the code is that simple, I agree copy/paste is better.

btbonval commented 9 years ago

*statically as in unbound methods being shoehorned into other objects as bound methods. very hackery.

btbonval commented 9 years ago

Testing against http://localhost:8080/note/harvard/cs-50-114/1st-lecture using the latest commit off the note-editing branch (4736b4b0994b07623ed5562090ddec9c8b3e2f4e).

There is html is in the note_markdown.html column.

# SELECT length(html) FROM notes_notemarkdown where note_id=1001; 
 length  
---------
 2409215
(1 row)

On beta's system, the link looks like this: http://karmanotes-beta.herokuapp.com/note/harvard/cs-50-114/1st-lecture

On my test system, the note appears blank. Browser complains that TypeError: dstFrame is null in the console, as part of note-detail.js line 53. Looking at that code, it is responsible for rendering the note. The console's version of note-detail is very different from the one in the code repo.

Ah, I bet I'm using old static content! I need to collectstatic!

btbonval commented 9 years ago

There seems to be a disconnect between where collectstatic sends data and where Django is serving it from, because I'm still seeing old static js code even in a fresh private browser window with the console enabled.

btbonval commented 9 years ago

I'm using beta's info, which is good because I'm basically testing what deployment will look like on the staging system.

Link serving the static file: http://d2ry5v6uki5dz0.cloudfront.net/static/js/note-detail.js

I'll have to check two things:

yourcelf commented 9 years ago

One thing to try: ask django to clear its cache. This ought to do that:

from django.core.cache import cache
cache.clear()
btbonval commented 9 years ago

I thought maybe it was just a delay from S3 to cloudfront, but the files are still not updated. Trying to clear the cache.

$ foreman run python manage.py shell
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from django.core.cache import cache
>>> cache.clear()
>>>

and I'll try another collectstatic.

btbonval commented 9 years ago

It probably wasn't the django cache, since that doesn't get in between cloudfront.net requests and the browser.

I'm looking at S3 and I see note-detail.js is modified Feb 15, while the other files are older. So it looks like collectstatic is pushing files to S3 successfully.

Time to learn about how CloudFront CDN works.

btbonval commented 9 years ago

CloudFront for beta has the correct S3 bucket as its origin, but it was last modified April 2014. I don't see any big button to pull from the origin. Problem is nailed down at least. Definitely need to learn this for updating prod later.

btbonval commented 9 years ago

Ah, the issue is with how caching is performed across the CDN. The assumption is that each file is unique and independent ... and permanent. In order to upload "new" files, you're supposed to upload a file with a unique name. Versioning file names is highly recommended.

According to this SO, there is an invalidation method. http://stackoverflow.com/questions/1086240/how-can-i-update-files-on-amazons-cdn-cloudfront

Need to POST an invalidation request to invalidate existing files. Unsure if they are then automagically re-discovered, or if there is a second step. http://docs.aws.amazon.com/AmazonCloudFront/latest/APIReference/CreateInvalidation.html

yourcelf commented 9 years ago

(The reason the cache might be implicated: though it doesn't get between the browser and CDN, compress uses the cache to decide which file to retrieve from the CDN. Cache clearing might be needed if the symptom is that the CDN receives the new files (named by the correct new md5 hash), but the page still renders requesting the old files (named by the old md5 hash))

btbonval commented 9 years ago

Watching the network transfer logs, the file name is "note-detail.js". In fact, every single file appears to be downloaded individually. In fact, on the s3 bucket, where collectstatic sends everything, there are only regular files and no compressed files.

I don't see any effect of compress in action (hence my ticket about it in #397)

yourcelf commented 9 years ago

Got it. Looks like compress is running with COMPRESS_ENABLED == False on staging then. I was looking at production for my responses there.

This could be either because DEBUG == False on staging, or because COMPRESS_ENABLED is explicitly set to False there.

btbonval commented 9 years ago

Dev boxes, which are designed to be very, very different from staging and production, do not use prod.py.

However, staging, production, and my VM are all running prod.py to mirror production as close as possible. So that variable is definitely enabled :(

btbonval commented 9 years ago

Just for confirmation, beta has DJANGO_SETTINGS_MODULE set to karmaworld.settings.prod. My VM also has that set inside .env which is read by foreman.

btbonval commented 9 years ago

Moving compress conversation back to ticket #397

I could try my hand at file invalidation in the meantime, but it will be a painstaking and manual process that cannot be automated in the future.

btbonval commented 9 years ago

While testing /note/harvard/cs-50-114/1st-lecture with note-editing branch but after the data migration put data in the database, I see invalid HTML. It is probably because the HTML has not been filtered or processed, but rather directly downloaded and crammed into the database.

...
  <div class="row">
    <div class="small-12 small-centered columns medium-12 large-12 body_copy">
      <div id="note-content-wrapper" class="note-text">

          <div class='note-html' id='note-html'>
            <!DOCTYPE html>
<!-- Created by pdf2htmlEX (https://github.com/coolwanglu/pdf2htmlex) -->
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pdf2htmlEX">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<style type="text/css">
/*! 
 * Base CSS for pdf2htmlEX
 * Copyright 2012,2013 Lu Wang <coolwanglu@gmail.com> 
 * https://github.com/coolwanglu/pdf2htmlEX/blob/master/share/LICENSE
 */#sidebar{position:absolute;top:0;left:0;bottom:0;width:250px;padding:0;margin:0;overflow:auto}#page-container{position:absolute;top:0;left:0;margin:0;padding:0;border:0}@media screen{#sidebar.opened+#page-container{left:250px}#page-container{bottom:0;right:0;overflow:auto}.loading-indicator{display:none}.loading-indicator.active{display:block;position:absolute;width:64px;height:64px;top:50%;left:50%;margin-top:-32px;margin-left:-32px}.loading-indicator img{position:absolute;top:0;left:0;bottom:0;right:0}}@media print{@page{margin:0}html{margin:0}body{margin:0;-webkit-print-color-adjust:exact}#sidebar{display:none}#page-container{width:auto;height:auto;overflow:visible;background-color:transparent}.d{display:none}}.pf{position:relative;background-color:white;overflow:hidden;margin:0;border:0}.pc{position:absolute;border:0;padding:0;margin:0;top:0;left:0;width:100%;height:100%;overflow:hidden;display:block;transform-origin:0 0;-ms-transform-origin:0 0;-webkit-transform-origin:0 0}.pc.opened{display:block}.bf{position:absolute;border:0;margin:0;top:0;bottom:0;width:100%;height:100%;-ms-user-select:none;-moz-user-select:none;-webkit-user-select:none;user-select:none}.bi{position:absolute;border:0;margin:0;-ms-user-select:none;-moz-user-select:none;-webkit-user-select:none;user-select:none}@media print{.pf{margin:0;box-shadow:none;page-break-after:always;page-break-inside:avoid}@-moz-document url-prefix(){.pf{overflow:visible;border:1px solid #fff}.pc{overflow:visible}}}.c{position:absolute;border:0;padding:0;margin:0;overflow:hidden;display:block}.t{position:absolute;white-space:pre;font-size:1px;transform-origin:0 100%;-ms-transform-origin:0 100%;-webkit-transform-origin:0 100%;unicode-bidi:bidi-override;-moz-font-feature-settings:"liga" 0}.@CSS_LINK_CN span{position:relative;vertical-align:baseline;display:inline-block;unicode-bidi:bidi-override}._{color:transparent;z-index:-1}::selection{background:rgba(127,255,255,0.4)}::-moz-selection{background:rgba(127,255,255,0.4)}.pi{display:none}.d{position:absolute;transform-origin:0 100%;-ms-transform-origin:0 100%;-webkit-transform-origin:0 100%}</style>
<style type="text/css">
...
btbonval commented 9 years ago

@yourcelf I wasn't sure when to do the PDF stuff, before or after the preserve formatting, or if preserve formatting does not run on PDFs at all.

Here's the path I took (constants are copy/pasted out of Note).

            # clean the html in a consistent way with note uploads as of the
            # time of this migration.
            if note.mimetype in PDF_MIMETYPES:
                # handle embedded images from pdf2htmlEX
                html = sanitizer.data_uris_to_s3(html)
            if note.category in EDITABLE_CATEGORIES:
                # make HTML editable
                html = sanitizer.sanitize_html_to_editable(html)
            else:
                # clean up HTML without concern for editing
                html = sanitizer.sanitize_html_preserve_formatting(html)
yourcelf commented 9 years ago

The end goal is to have every note model have an associated NoteMarkdown [sic] which contains its html (and no markdown).

The main pipelines are currently:

  1. Newly uploaded file: goes through task processing in gdrive.py.
  2. Edited note: the NoteMarkdown.save method is called which runs sanitizers depending on the note type. Looking over the code, it would make sense to move the sanitizer.data_uris_to_s3 into the save method so that it runs on user-supplied HTML as well as on user-supplied PDFs.

We're talking here about a 3rd, new pipeline, which is "migrate all old notes with static-hosted HTML to inlineable database-hosted HTML." So the decision pipeline for me would go something like this:

  1. Get the HTML. Does the note have an associated static html file? If yes, download it. If no, generate or retrieve HTML from pdf2htmlEX or google drive or whatever. (Basically, do what gdrive.convert_raw_document does). Create or retrieve the NoteMarkdown model for the note, and stuff the HTML into it.
  2. Sanitize. If we move the data_uris_to_s3 part into NoteMarkdown.save, sanitizing is just a matter of calling save on that model (or if it's happening in a migration, duplicating the logic in the save method).
btbonval commented 9 years ago

Interesting stats from the migration of the staging static hosted notes:

Migrated 57 notes and failed to migrate 74 notes.
Of good notes, 5 are editable and 52 are not.
Of not editable notes, 57 had run through pdf2htmlEX.

Either something is wrong with if note.mimetype in PDF_MIMETYPES, or all 57 successful notes had some kind of mimetype that indicated they were run through pdf2htmlEX. A number of failed notes do have pdf in the title (5 in the first 7 listed):

Failed list:
progderechos_y_libertades201213pdf
anonymous-apppdf
-4-1-948651
1st-week-notes
geneve_1564pdf-12-21-727610
equivalent-rule-2-of-4pdf-12-19-928568
final-finalsclub-foundation-annual-report-2013pdf
...

I'm curious what the breakdown of mimetypes is for the failed notes. I should probably find out.

btbonval commented 9 years ago

For some reason I assumed PDF notes would not be categorized as (editable) lecture notes, but in fact, 5 were. I'm not sure what to do about this situation. The output from pdf2htmlEX is not friendly for editing.

@yourcelf , @AndrewMagliozzi Any opinion on this? Should we change the category of "Lecture Notes" which were PDFs to something non-editable like study guide or other? Should we create a new category like "Lecture Notes (readonly)" ?

yourcelf commented 9 years ago

sanitizer.sanitize_html_to_editable should work adequately for pdf2htmlEX output. I'd change your code above to use note.is_editable() to branch sanitization type, rather than the mimetype.

btbonval commented 9 years ago

@yourcelf oh, so we can make pdf2htmlEX output editable? I misunderstood from our conversation about it. That simplifies the issue considerably!

If I understand correctly, data_uris_to_s3 is only relevant when a note has been processed with pdf2htmlEX, because otherwise we don't expect embedded base64 binary data. You seem to suggest adding it to Note.save(), which means it'd basically run on every note. I take it that means it can't hurt to run that filter on every note, regardless of its original mimetype?

Does the order of filtering matter? I'm still not clear if data_uris_to_s3 can be run before or after other filters, or if there are some caveats or gotchas. In theory, filters should be independent, so order should not matter.

I do branch on note.is_editable(). It's an independent if from the mimetype check. I should put a line space to make that more clear.

yourcelf commented 9 years ago

Yes, all of sanitizer.sanitize_html_to_editable, sanitizer.data_uris_to_s3, and sanitizer.sanitize_html_preserve_formatting should be idempotent; rerunning them on their own outputs should produce the same output. Running the data_uris_to_s3 won't alter anything that lacks data URIs.

The advantage of putting it in NoteMarkdown.save is that in the future we might be able to support "uploading" images via data-uris in the rich text editor, and we'll properly handle any data URIs thrown in by other document types -- not sure if Google Drive adds them to its HTML conversions of word docs etc., but I wouldn't be surprised if it did. Regardless, it won't hurt anything to run data_uris_to_s3 on them anyway.

Regarding ordering: if I remember right, both flavors of sanitize_html_... strip and discard data URIs. So if we want to keep them, data_uris_to_s3 needs to go first, and further sanitization second.

btbonval commented 9 years ago

@yourcelf Ah, good to know about idempotence and ordering. In that case, I think my code is right with a small change. I'll run data_uris_to_s3 regardless of mimetype, run it before the other filters (which it coincidentally already was), and continue switching on my copy pasta version of is_editable().

Great!

btbonval commented 9 years ago

Testing /note/harvard/cs-50-114/1st-lecture.

It's there. No javascript errors. Doesn't even look too bad.

Yikes, ton of CORS errors blocking static elements along these lines:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://karma-beta.s3.amazonaws.com/fonts/0a9c3323-5f9b-46df-baf3-56e21ea76a8d.woff. This can be fixed by moving the resource to the same domain or enabling CORS.

Need to find an editable migrated note (of the 5).

btbonval commented 9 years ago

Of the 5 editable notes on the staging system which were migrated from static notes, 4 are the same document. So two unique documents.

They are there. They did not translate very well, but that is to be expected. All the content does appear in place, and there are some images.

No console errors about CORS, but there is this error TypeError: m.getComputedStyle(...) is null ... wysihtml5x-toolbar.min.js:5.

The WYSIWYG editor is accessible, but it is showing raw HTML. There is extraneous spacing and empty tags. Removing some of this extraneous stuff works fine. Trying to add a bold tag to some plain text causes no particular errors on the console, but seems to reload the page rather than changing the text. This is true of both texts.

btbonval commented 9 years ago

Here's a quick view of the production database stats. Of 250 editable notes (category: Lecture Notes), only 6 are application/pdf (or anything else that pdf2htmlex would handle). I'm very curious about the 5 jpeg notes and how they would translate into an editable form. In large part, I think we don't have to worry about getting pdf2htmlex output editing perfectly.

=> SELECT mimetype, count(*) FROM notes_note WHERE category = 'LECTURE_NOTES' GROUP BY mimetype;
                                mimetype                                 | count 
-------------------------------------------------------------------------+-------
 text/plain                                                              |     4
 text/html                                                               |     7
 application/pdf                                                         |     6
 image/jpeg                                                              |     5
 application/vnd.openxmlformats-officedocument.wordprocessingml.document |   228
(5 rows)
btbonval commented 9 years ago

Same call on staging just to see how the distributions align. Looks like each type seen on production exists somewhere on staging. It'll be worth checking how these all react to being downloaded from static to consider whether the original filepicker or gdrive file should be downloaded instead, or if it works well enough as-is.

=# SELECT mimetype, count(*) FROM notes_note WHERE category = 'LECTURE_NOTES' GROUP BY mimetype;
                                mimetype                                 | count 
-------------------------------------------------------------------------+-------
 text/plain                                                              |     3
 application/pdf                                                         |     5
 image/jpeg                                                              |     3
 application/vnd.openxmlformats-officedocument.wordprocessingml.document |     3
 application/msword                                                      |    15
 image/png                                                               |     2
(6 rows)
btbonval commented 9 years ago
btbonval commented 9 years ago

Managed to pick editable versions of everything but image/jpeg. /note/adrian-dominican-montessori-teacher-education-institute/course-131/say-cheese was editable. All edit tests worked as desired.

Both image/jpeg examples show no images, they're trying to load a broken googleusercontent.com link. One of the image/png didn't even have a filename in the URL, it was just the domain. This is worth investigating to see if it affects production or if it is a result of these migrations.

uhm so I guess this ticket has migrated into testing the note editing data migrations, not just the static notes.

btbonval commented 9 years ago

Reverting the static files and code for the staging system so I can test the images to see how it handles them.

http://karmanotes-beta.herokuapp.com/note/harvard/reason-and-faith-in-the-west/imagejpg-4-23-317946 shows an image (and apparently displays some poor OCR). The correct URL for the image should be https://lh4.googleusercontent.com/m7DjSVOLh6Si5TlMQoz3O64-j2pB5CFIRKgniUwBkrR-PgIym3LwlEmmUZeafDNyLJSkuOSWgwIRdZsCiszmgUICC3tJu_KyQhtHa2NoT_197hGQCBEXJtkDVVPe008M

but it is extremely truncated in the new code base version, leading to a bad file link.

btbonval commented 9 years ago

In the markdown version in the database, the URL looks like this:

![](https://lh4.googleusercontent.com/m7DjSVOLh6Si5TlMQoz3O64
 -j2pB5CFIRKgniUwBkrR-PgIym3LwlEmmUZeafDNyLJSkuOSWgwIRdZsCiszmgUICC3tJu_KyQhtHa
 2NoT_197hGQCBEXJtkDVVPe008M)

 * * *

 FICTION

there is a line break in the URL which seems to align with the broken image link.

btbonval commented 9 years ago

Ah the line break doesn't bother markdown, but the conversion from markdown to HTML drops everything after the line break.

 <p><img src="https://lh4.googleusercontent.com/m7DjSVOLh6Si5TlMQoz3O64"></p> 
 <hr>
 <p>FICTION  </p> 
btbonval commented 9 years ago

Forked line break issue to #404 .

Data migrations are not yet ready for staging or production.

btbonval commented 9 years ago

Retesting all links above.

So everything looks good. Except the pdf link. That spawns #406 .

Since there is a true overlap between uploaded PDFs and Lecture Notes (which are editable), we need to ensure pdf2htmlex output can be edited with wysihtml. Right now something awkward is happening.

btbonval commented 9 years ago

Alright, it looks like the only stopper to finishing this ticket off was actually bug related to FF and our implementation of WYSIHTML, and nothing at all to do with migrations.

I think the data migrations look good so far. I'm ready to push this all to the staging system beta for testing by live users. However, there might be problems with users running FF when editing notes.

btbonval commented 9 years ago

woops no reason to keep this ticket open. data migrations look good. any further problems found by live users or whatever will be new tickets.