Upload HTML directly to S3 bucket, do not dump in database #273

The S3 static buckets for beta and prod now have an 'html' directory in the base.

When we have HTML, instead of storing it into the database, we want to write it as a file onto S3 (possibly name the file by hash). Instead of storing the html in the database, we want to store the relative static path into something like Note.static_relpath.

The Note detail template would then use an IFRAME pointing at {{ STATIC_URL }}html/{{Note.static_relpath}}.

We will also need to post-process the HTML already in Production and on the VM and push that out to S3.

This will be a one time deal rather than a recurring thing, so a quick script that doesn't need to hang around should suffice.

Did a quick search to see how out of fashion IFRAMEs are. Found this question about IFRAME and SEO. Being that SEO is a pretty recent topic, there is a good comment in here:!topic/webmasters/Y6DyIR7wLXg

Make sure there is an anchor link to IFRAME content on the page with the IFRAME. That sounds like good practice anyway, in case someone turns off IFRAMEs because they're so 1995.

sanitize_html parses html in-place on the model. e.g. it loads self.html and saves self.html. We probably want to change this into a filter.

We probably won't need to batch process HTML across notes, and if we do, the current function will need to be rewritten anyway. Should remove this:

beautifulsoup is part of the requirements. lxml does one thing, which is in sanitize_html. I have to rewrite sanitize_html to be a filter anyway, so if I replace lxml, the world will be a better, brighter place.

No need to store a URL for the HTML snippet. Note.slug is supposed to be unique. I'm adding unique, not-null to Document.slug which will inherit to Note.slug. The static S3 filename will be based on the Note slug.

was trying from import default_storage to write files.


>>> default_storage.bucket_acl

Our configs are for a read-only API interface, which means there won't be uploading? How the heck does collectstatic work if it can't actually write to the S3 bucket using the settings?

I am missing something key.

I was trying to create a file by simply opening it and writing to it, as per

That gives me an IOError even though open is set to write/create mode:

>>> somefile ='bryantestfile.html', 'w')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/files/", line 33, in open
    return self._open(name, mode)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/storages/backends/", line 177, in _open
    raise IOError('File does not exist: %s' % name)
IOError: File does not exist: bryantestfile.html

Maybe Django sets the default_storage to read-only mode for static hosting reasons, but switches it for collectstatic. Clearly the bucket has everything it needs:

>>> default_storage.bucket.get_acl()
<Policy: Andrew (owner) = FULL_CONTROL>
Nothing in docs about default_storage.acl. Nothing about ACL in django-storages. Only thing about ACL is in s3boto, but we can see that bucket ACL from s3boto is just fine.

guess who has two thumbs and has to read source code. this guy. nn/ \nn

>>> import storages.backends.s3boto
>>> protected_storage = storages.backends.s3boto.S3BotoStorage(acl='private')
>>> with'html/bryantest.html', 'w') as s3file:
...     s3file.write(html)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/files/", line 33, in open
    return self._open(name, mode)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/storages/backends/", line 177, in _open
    raise IOError('File does not exist: %s' % name)
IOError: File does not exist: html/bryantest.html
>>> protected_storage.acl
>>> protected_storage.bucket_acl
>>> protected_storage.bucket.get_acl()
<Policy: Andrew (owner) = FULL_CONTROL>


According to the above link, the acl comes from here:

'public-read' should still give the owner full control, but the allusers group gets read.

It would seem like a bad idea to change the S3 ACL from 'public-read'. Not sure how to access this S3boto stuff as the owner.

Files are called Keys in the raw s3boto bucket. e.g. default_storage.bucket.get_key('img/asc.gif'). new_key() creates a theoretical file on the S3 bucket.*() commands don't work, which would be nice for writing directly to the S3 file. Key.send_file() does work. Wrap up the HTML in a little StringIO file-like object and BAM, I just uploaded to S3.

Tested and confirmed. Ugly as junk.

>>> flo = StringIO(html)
>>> nk = default_storage.bucket.new_key('html/bryantest.html')
>>> nk.exists()
>>> nk.send_file(flo)
>>> nk.exists()
>>> with'html/bryantest.html', 'r') as s3file:
...     print

<a href="whaaaat">the</a>
<a href="test" target="_blank">
<a href="nope" target="werrird">wa</a>
Most of the code is written now. I tried to kick off a process to convert HTML in the database to files on S3, but failed:

(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python populate_s3
Traceback (most recent call last):
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/management/commands/", line 42, in handle
    htmlflo = StringIO(note.html)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ue001' in position 10111407: ordinal not in range(128)

"The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called."

might as well pass the HTML into BeautifulSoup to see if it can read in the data and output it in consistent UTF-8.

liar liar pants on fire. It turns out BeautifulSoup does not output UTF-8 by default even though all the docs say it does. Gotta run soup.prettify("utf-8") and suddenly StreamIO is pleased.

oh good. random disconnection errors or something. More or less exactly what I want to deal with right now.

(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python populate_s3
Processing html/mit6_007s11_lec07pdf.html
Traceback (most recent call last):
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/management/commands/", line 48, in handle
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/boto/", line 910, in make_request
    return self._mexe(http_request, sender, override_num_retries)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/boto/", line 872, in _mexe
    raise e
socket.error: [Errno 32] Broken pipe
well I guess I won't be running this overnight to process.

Can't test if anything worked until I get one Note onto S3 to see if my VM hosts it properly. Can't get one Note onto S3 because broken pipe.

Pushing WIP to origin as feature_html_on_s3 with commit HEAD 87bf8e2441fe35cbdac8cae713f8361557ab8275

rebased master into branch and ran tests.

... still running.

still running?

top says the CPU is mostly running SSHD and top. tests deadlocked?

Looks like the tests are stuck running Xvfb, which is in turn not running anything (although it should run firefox). Time to double check master still works.

vagrant@vagrant-ubuntu-precise-32:~$ ps ax | grep python
 3219 pts/1    S+     0:02 python test
 3286 pts/0    S+     0:00 grep --color=auto python
vagrant@vagrant-ubuntu-precise-32:~$ pstree -p | grep -C 3 3219
        |               |-{rsyslogd}(838)
        |               `-{rsyslogd}(839)
        |           `-sshd(2078)---sshd(2164)---bash(2165)-+-grep(3289)
        |                                                  `-pstree(3288)
Tests completed on master branch in ~4 minutes.

Something tripped up feature_html_on_s3 branch so that tests deadlock :( No backtraces to help.

python test -v 2 seems to be giving better output. Looks to be hungup on Evernote.

Test searching for a school by partial name ... ok
Test upload of an Evernote note ...

Same pstree as before with the dangling Xvfb. Definitely stuck here.

Code: calls

Only place I can imagine it hanging is on convert_raw_document?

The feature_html_on_s3 branch has no changes in the raw_document app.

Double ctrl-c got a super long backtrace!

Test upload of an Evernote note ... ^C^CTraceback (most recent call last):
  File "", line 14, in <module>
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/management/", line 255, in execute
    output = self.handle(*args, **options)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/south/management/commands/", line 8, in handle
    super(Command, self).handle(*args, **kwargs)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/management/commands/", line 89, in handle
    failures = test_runner.run_tests(test_labels)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django_nose/", line 155, in run_tests
    result = self.run_suite(nose_argv)
  File "/usr/lib/python2.7/unittest/", line 327, in run
  File "/home/vagrant/karmaworld/karmaworld/apps/document_upload/", line 53, in testEvernoteConversion
    'mimetype': 'text/enml'})
  File "/home/vagrant/karmaworld/karmaworld/apps/document_upload/", line 36, in doConversionForPost
    convert_raw_document(raw_document, user=user, session_key=session_key)
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/", line 244, in convert_raw_document
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/boto/s3/", line 727, in send_file

Ahh that'd certainly be unique to this branch. Hanging on direct upload to S3. The html folder on the appropriate S3 is empty. Guess I'll play with this feature a little more, it's still leaving cake on the toothpick.

note for later: It seems worth moving this one function for uploading to S3 from into Note.

Testing a PDF that rends to 2.87 MiB of HTML using (mostly) what would be performed right now. Upload seems to do zilch.

In [7]: rds = RawDocument.objects.all()
In [14]: fp_file = rds[1].get_file()
In [19]: html = pdf2html(
Preprocessing: 88/88
Working: 88/88
In [20]: len(html)
Out[20]: 3012503
In [21]: fhtml = notes[0].filter_html(html)
In [22]: len(fhtml)
Out[22]: 3365756
In [23]: filepath = notes[0].get_relative_s3_path()
In [24]: filepath
Out[24]: 'html/certificate-path-validation-testingpdf.html'
In [28]: fhtmlflo = StringIO(fhtml)
In [29]: newkey = default_storage.bucket.new_key(filepath)
In [30]: newkey.exists()
Out[30]: False
In [33]:
In [35]: def status_update(transmit, maximum): print "transferred {0} / {1}".format(transmit, maximum)
In [36]: newkey.send_file(fhtmlflo, cb=status_update)
transferred 0 / 0
transferred 0 / 0
transferred 0 / 0
Trying something a bit smaller actually uploads a bit, then fails.

In [37]: smallhtml = """
   ....: <html>
   ....: <body>
   ....: HI FRIENDS!
   ....: </body>
   ....: </html>
   ....: """

In [38]: smallhtmlflo = StringIO(smallhtml)

In [39]: len(smallhtml)
Out[39]: 43
In [40]: newkey.send_file(smallhtmlflo, cb=status_update)
transferred 0 / 0
transferred 43 / 0
S3ResponseError: S3ResponseError: 400 Bad Request

Sooo there's this "size" parameter. Maybe that'll make the denominator stop being 0?

In [42]:
In [43]: newkey.send_file(smallhtmlflo, cb=status_update, size=43)
transferred 0 / 43
transferred 43 / 43
In [44]: newkey.exists()
Out[44]: True


Not sure why my first attempt worked without size:

Deleted file on S3. Trying again with big file, specifying size. Prints two updates and then hangs.

In [45]: newkey = default_storage.bucket.new_key(filepath)
In [46]: newkey.exists()
Out[46]: False
In [47]:
In [51]: newkey.send_file(fhtmlflo, cb=status_update, size=3365756)
transferred 0 / 3365756
transferred 0 / 3365756

Seems to be a problem with s3boto's send_file. Time to ask the interwebs.

btw this is where it hangs, writing to SSL:

/usr/lib/python2.7/ssl.pyc in send(self, data, flags)
    196             while True:
    197                 try:
--> 198                     v = self._sslobj.write(data)
btbonval commented 10 years ago

I can totally avoid File Like Objects!

In [56]: newkey.set_contents_from_string(fhtml, cb=status_update)
transferred 0 / 3365756
transferred 376832 / 3365756
transferred 753664 / 3365756
transferred 1130496 / 3365756
transferred 1507328 / 3365756
transferred 1884160 / 3365756
transferred 2260992 / 3365756
transferred 2637824 / 3365756
transferred 3014656 / 3365756
transferred 3365756 / 3365756
Out[56]: 3365756
In [57]: newkey.exists()
Out[57]: True

confirmed on s3! woop. that was pretty quick to upload.

Rewrote upload code to use set_contents_from_string. Moved upload code into Note. Replaced copy pasta in and to make use of the upload code in Note. commit 7b61d0712b486ec27c770c84b7e4ae016b6e7591

Running tests again.

a number of tests errored. It looks like the tests hung, but firefox is actively running at the moment. It's been 5 minutes. :/

karmaworld.apps.notes.models: ERROR: Error with IndexDen:
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/", line 131, in create_index
    raise TooManyIndexes(e.msg)
TooManyIndexes: "Too many indexes for this account"

also made a copy/paste mistake.

A few errors showing up, hanging on the firefox test as before.

This time, however, there are three HTML files on the S3!

The hanging thing bothers me. I'll have to use some verbose to see where that is happenin.

Test upload of an Evernote note ... ok
Test upload of a file with a bogus mimetype ... ok

No files in S3 after these.

The later upload tests have files in S3 after they run.

Tests didn't hang using verbose output. How bizarre.

Test that doesn't make a slug ... ERROR
Search for a note within IndexDen ... ERROR
Test that the slug field is slugifying unicode Note.names ... ok
testCreateCourse (test_selenium.AddCourseTest) ... ok

This test appears moot now that slug is unique and not nullable.

ERROR: Test that doesn't make a slug
Traceback (most recent call last):
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/", line 85, in test_save_no_slug # re-save the note
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/", line 54, in execute
    return self.cursor.execute(query, args)
IntegrityError: null value in column "slug" violates not-null constraint

I'm guessing this is due to IndexDen not adding any more indices right now.

ERROR: test suite for <class 'karmaworld.apps.notes.tests.TestNoes'>
Traceback (most recent call last):
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/nose/", line 227, in run
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/nose/", line 350, in tearDown
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/", line 58, in tearDownClass
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/", line 38, in delete_index
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/", line 152, in delete_index
    _request('DELETE', self.__index_url)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/", line 457, in _request
    raise HttpException(response.status, response.body)
HttpException: HTTP 404: ["No index existed for the given name"]

Three failures from error, no true failures.

Time to check it by hand!

Removed obsolete null Note.slug test, down to 2 errors caused by IndexDen. Can't get much further than this for now.

uploaded objects to S3 do not give permission to open/download them.

Need to do what is in this comment:

btbonval commented 10 years ago

btbonval commented 10 years ago

These docs are about as helpful as a bag of wet socks. I guess there are uses for a bag of wet socks, but not many.

Here's what an Everyone Open/Download policy looks like in s3boto:

In [35]: policy.acl.grants[4].permission
Out[35]: u'READ'
In [36]: policy.acl.grants[4].display_name
In [37]: policy.acl.grants[4].type
Out[37]: u'Group'
In [38]: policy.acl.grants[4].uri
Out[38]: u''
In [39]: policy.acl.grants[4].id
In [42]: policy.acl.grants[4].__class__
Out[42]: boto.s3.acl.Grant

So to make that, it'd be something like

from boto.s3.acl import Grant
# once key exists
policy = newkey.get_acl()
policy.acl.add_grant(Grant(permission=u'READ', type=u'GROUP', uri=u''))
Permission attempt failed. No errors, but the permissions according to S3 do not include Everyone.

Time for guess and check.

I think the first problem is that changing the policy as noted above does not save that policy remotely. Probably need to call one of the newkey.set_*acl() commands.

In [12]: newkey.set_acl(policy)
S3ResponseError: S3ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>MalformedACLError</Code><Message>The XML you provided was not well-formed or did not validate against our published schema</Message><RequestId>3E57DBBC88D03C8E</RequestId><HostId>W1O4/vy8nDyXEhcgawGHyJrCFmGsaYpqwPcE5CwaLVWVXhuSfB/Suhq/6w0YFMSu</HostId></Error>

Here's a problem. Converting the permission into XML ignores the AllUsers URI.

In [23]: all_read.uri
Out[23]: u''
In [24]: all_read.to_xml()
Out[24]: u'<Grant><Grantee xmlns:xsi="" xsi:type="GROUP"><EmailAddress>None</EmailAddress></Grantee><Permission>READ</Permission></Grant>'
type is "GROUP". Looking at Boto source code it is case sensitive 'Group'.

I'm tempted to write a ticket over there, but it's probably one of those things where the standard for the XML or whatever is case sensitive, therefore the Python must be as well.

Here's what the grant XML should look like when it's correct vs what is being generated (identical):

In [48]: oldkey.get_xml_acl()
Out[48]: '...<Grant><Grantee xmlns:xsi="" xsi:type="Group"><URI></URI></Grantee><Permission>READ</Permission></Grant>...'
In [50]: all_read.to_xml()
Out[50]: u'<Grant><Grantee xmlns:xsi="" xsi:type="Group"><URI></URI></Grantee><Permission>READ</Permission></Grant>'

So the problem appears to be with boto's ability to generate either the ACL XML or the Policy XML in a way that satisfies S3.

As an experiment, let's just take the preexisting acl text and write it to the new key.

In [51]: newkey.set_xml_acl(oldkey.get_xml_acl())
In [52]:

Looks good on the S3 management page. I guess I'll just grab that raw XML and put that into the source code. :(

Fugly fugly fugly but it worked. That XML ACL is huge to be dropping in as a string, but boto is too messed up to do anything else I guess. I see the file on S3 with proper ACLs.

When viewing on the site, the URL asks if I want to download it, rather than showing it in the IFRAME.

Changed over to static S3 properly, and it still pops up a download question. It's an HTML file! Maybe the meta data is wrong?

Yup. Metadata problem. content-type: application/octet-stream

Gotta make sure these things all get uploaded with content-type as text/html.

That fixes the problem, but it takes forever to download from S3! Also the one I'm looking at looks terrible.

DIEEEEEE BOTOOOOO!!!! (read as: boto.s3 doesn't do nothin with metadata!?)

In [5]: oldkey = default_storage.bucket.new_key('html/14_motor1pdf.html')
In [6]: oldkey.exists()
Out[6]: True
In [7]: oldkey.metadata
Out[7]: {}
In [8]: oldkey.get_metadata()
TypeError: get_metadata() takes exactly 2 arguments (1 given)
In [9]: oldkey.get_metadata('content-type')
In [10]: oldkey.get_metadata('Content-Type')
In [11]: help(oldkey.get_metadata)
Help on method get_metadata in module boto.s3.key:

get_metadata(self, name) method of boto.s3.key.Key instance
In [15]: oldkey.get_metadata(
In [16]:

btw there is absolutely content-type on every single object, but especially this one when I explicitly set.

Also tried the above iwht lookup instead of new_key, but I suspect they are exactly the same thing.

get_metadata is just a wrapper around metadata attribute.

Here's where it gets metadata, during open_read() (not during, of course!). not even a memoized fetching dict, just a dict.

I don't have enough middle fingers for this.

In [25]: oldkey.open_read()
In [26]: oldkey.metadata
Out[26]: {}
In [27]: oldkey.metadata.__class__
Out[27]: dict
So even if I /read/ the metadata, it'd just be a local cached dict that gets updated.

It doesn't push that stuff anywhere. ever.