Closed sethwoodworth closed 10 years ago
Lets also make sure we add a link to the original page so we maintain the CC-BY-NC compliance.
On Jan 31, 2013, at 6:41 PM, Seth Woodworth notifications@github.com wrote:
Consider adding a ten of the top courses with video from MIT OCW as well. Do this by hand and add no more than 15 courses per network
— Reply to this email directly or view it on GitHub.
CC-BY-NC licensing is now issue #97
Here is a link to the scraper for the MIT-OCW site: https://github.com/AndrewMagliozzi/mit-ocw-scraper (make sure to checkout the MIT-notes branch)
"courseLink": "http://ocw.mit.edu//courses/nuclear-engineering/22-105-electromagnetic-interactions-fall-2005",
"courseStub": "22-105-electromagnetic-interactions-fall-2005",
"courseTitle": "Electromagnetic Interactions",
"professor": "Prof. Jeffrey Freidberg",
"noteLinks": [
{
"link": "http://ocw.mit.edu/courses/nuclear-engineering/22-105-electromagnetic-interactions-fall-2005/lecture-notes/lecture1.pdf",
"fileName": "lecture1.pdf"
},
parse out course info: year from courseLink, title, professor.
parse out all links as notes. parse note info: title from fileName, no email address, tags of mit-ocw and karma.
Also modify the database for handling licenses.
pass remote link to FilePicker. (figure that bit out)
how to convert filepicker results and shove them into database: https://github.com/FinalsClub/karmaworld/blob/c5af62fe0c2d14f2420f1eef0ab577b95f2e68d9/karmaworld/apps/document_upload/tests.py
license handling of #97 is done in commit 34ea96f09d5c5c80748526171ad5d3c44aef0679
looks like there is no pythonic interface to FilePicker. Best answer seems to always be curl. http://stackoverflow.com/questions/14115280/store-files-to-filepicker-io-from-the-command-line
Might as well implement something with urllib or whatevs, grab the API key out of secrets, whatnot.
hrm. curl -F blah=@file
will use multipart/form-data to upload files as though submit to a form. This is recommended by the above stackoverflow and on Filepicker's RESTful API:
https://developers.inkfilepicker.com/docs/web/#inkblob-store
However, when I upload files using requests
multipart/form-data, the MIME type returned by Filepicker is "multipart/form-data" rather than the MIME type of the actual file.
http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
I give up for now. No matter what I do, Filepicker says the file type is "multipart/form-data", yet I see no reason for this. Check back with fresh eyes.
commit to feature_ocw_upload in 3eb6d5eba963c7f30011ec330e9465f1670c5e95
only other thing I can think of is to pass in the byte array using dlresp.content
instead of the file-like object of dlresp.raw
, but that shouldn't change how the files parameter works for the requests
POST (and thus should not effect the mimetype interpretation). worth a try tho.
this is the bit that won't seem to upload properly: https://github.com/FinalsClub/karmaworld/blob/3eb6d5eba963c7f30011ec330e9465f1670c5e95/karmaworld/apps/notes/management/commands/import_ocw_json.py#L95-L102
Is there an option to do a buffered download?
On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com wrote:
this is the bit that won't seem to upload properly: https://github.com/FinalsClub/karmaworld/blob/3eb6d5eba963c7f30011ec330e9465f1670c5e95/karmaworld/apps/notes/management/commands/import_ocw_json.py#L95-L102
— Reply to this email directly or view it on GitHub.
I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.
On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com wrote:
this is the bit that won't seem to upload properly: https://github.com/FinalsClub/karmaworld/blob/3eb6d5eba963c7f30011ec330e9465f1670c5e95/karmaworld/apps/notes/management/commands/import_ocw_json.py#L95-L102
— Reply to this email directly or view it on GitHub.
Have to take cat to vet shortly, but I'll be ready to take a look when I get back.
Good thought on uploading via link, but I didn't see how to do that via FP RESTful API docs. Should be possible. On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com wrote:
I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.
On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com wrote:
this is the bit that won't seem to upload properly:
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31655516 .
I think you can just pass the URL instead of the local file path. Let's try it when you get back.
On Mon, Jan 6, 2014 at 1:07 PM, Bryan Bonvallet notifications@github.comwrote:
Have to take cat to vet shortly, but I'll be ready to take a look when I get back.
Good thought on uploading via link, but I didn't see how to do that via FP RESTful API docs. Should be possible. On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com wrote:
I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.
On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com
wrote:
this is the bit that won't seem to upload properly:
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub< https://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31655516> .
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31670922 .
curl -X POST -d "url=palmzlib.sourceforge.net/images/pengbrew.png"; " filepicker.io/api/store/S3?key=MY_API_KEY&path=/images/…;
On Mon, Jan 6, 2014 at 3:28 PM, Andrew Magliozzi <andrew.magliozzi@gmail.com
wrote:
I think you can just pass the URL instead of the local file path. Let's try it when you get back.
On Mon, Jan 6, 2014 at 1:07 PM, Bryan Bonvallet notifications@github.comwrote:
Have to take cat to vet shortly, but I'll be ready to take a look when I get back.
Good thought on uploading via link, but I didn't see how to do that via FP RESTful API docs. Should be possible. On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com wrote:
I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.
On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com
wrote:
this is the bit that won't seem to upload properly:
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub< https://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31655516>
.
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31670922 .
aha, it's in the API.
curl -X POST -d url="https://www.inkfilepicker.com/static/img/watermark.png" https://www.filepicker.io/api/store/S3?key=MY_API_KEY
This is how you specify the URL to FP and let them download it.
Getting non-unique error from same course over different academic years.
DETAIL: Key (school_id, name, instructor_name)=(10464, Designing Your Life, Gabriella Jordan, Lauren Zander) already exists.
There is a unique constraint which does not include Academic Year but should.
However, there is no way to add Academic Year in the form. #253
Also we need to toss department into the import following completion of #236
Notes are duplicating. It appears Django is deciding to insert instead of update. One note has license and upstream_link set, the other does not. There is a single call of gdrive's convert_raw_document
over a single RawDocument object.
RawDocument is updated in convert_raw_document
. Note only has save
called once, excepting possibly the call to sanitize_html
or some other Note method which might do its own save.
RawDocument.save
calls celery to run convert_raw_document
via process_raw_document
.
So celery does it one time and the conversion code does it one time.
remove "year" from the create_or_get statement so that it grabs the correct course agnostic of year.
VM is sucking in courses.
Start new VM from scratch, suck in ALL notes.
If that works, move to beta.
Upload to VM one time. If everything works well, switch over to using dump_json and restore_json to bring the VM notes over to beta.
Before testing VM, complete Professor stuff in #235. Email addresses can be added later.
Script is updated with professor stuff in #235 and department stuff in #236.
Script is running through all JSON on VM as we speak.
If successful, the script should be all set and this ticket can be closed.
That's a lot of notes!
$ grep '"link":' *.json | wc -l
24415
$ grep '"link":' *.json | uniq | wc -l
24289
There is concern about:
so yeah. let's not run all these.
Running these department's notes: BCS, EECS, Math. We'll see where that gets us.
Got to here and errored:
Uploading link http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-59j-psycholinguistics-spring-2005/lecture-notes/0407_speech_1.pdf to FP.
Saving raw document to database.
Sending to GDrive and saving note to database.
this is the mimetype of the document to check:
application/pdf
...
File "/home/vagrant/karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py", line 155, in handle
convert_raw_document(dbnote)
File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 202, in convert_raw_document
file_dict = upload_to_gdrive(service, media, filename, mimetype=mimetype)
File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 152, in upload_to_gdrive
convert=True, ocr=ocr).execute()
...
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/apiclient/http.py", line 816, in _process_response
raise HttpError(resp, content, uri=self.uri)
apiclient.errors.HttpError: <HttpError 500 when requesting https://www.googleapis.com/upload/drive/v2/files?uploadType=resumable&convert=true&ocr=true&alt=json returned "Internal Error">
Google Drive returned Error 500. That's not a good sign, and also nothing I can do about it.
Meh. Guess Google wanted a break. Kicked on the script and it picked up where it left off. No problems for now.
Looks like Google Drive is having a hard time chewing this doc:
Uploading link http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-66j-computational-cognitive-science-fall-2004/lecture-notes/sept_23_2004_fin.pdf to FP.
Saving raw document to database.
Sending to GDrive and saving note to database.
this is the mimetype of the document to check:
application/pdf
...
File "/home/vagrant/karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py", line 155, in handle
convert_raw_document(dbnote)
File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 217, in convert_raw_document
note.html = pdf2html(original_content)
File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 83, in pdf2html
raise ValueError("PDF file could not be processed")
ValueError: PDF file could not be processed
This problem repeats. Going to the site, the PDF appears to load just fine. There's some problem with pdf2html
around line 83.
https://github.com/FinalsClub/karmaworld/blob/52982fda8ac88654ac75c5759a09c0f67a7aa9cd/karmaworld/apps/notes/gdrive.py#L83
Removed that note from the JSON file and kicked the process off. It'll keep working through BCS notes in the meantime.
Another failure on http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec5b_faces_ip.pdf.
I think I need to write some code in the import thing that allows it to log problems like this to file (for later review) but continues to run.
Just saw a pdf2html
error: *** glibc detected *** pdf2htmlEX: double free or corruption (out): 0xb75a2008 ***
Error recovery seems to be working. If a note fails to convert, it is removed from the database instead of being left with empty html/text fields. commit in 732f89f98b3126f81a0d79b48d5009b8edd48d16
Since some notes failed to convert before deletion was added to the code, notes are tested for being partial as they are parsed in JSON. If a partial note is found, it is deleted and then reprocessed as though it hadn't been there at all. This means partial notes are removed and then convert is run on the rawdocument all over again.
I think gdrive_url is being cached. We should make use of this somewhere to prevent uploading the same file many times.
Finished BCS notes. That took forever. ~439 notes. PDF errors:
EECS is now parsing in my VM.
Strange those notes errored. There doesn't seem to be anything unusual about them. If pdf2html fails, do we roll back to PDF.js for display?
On Jan 8, 2014, at 4:11 AM, Bryan Bonvallet notifications@github.com wrote:
Finished BCS notes. That took forever. ~439 notes. PDF errors:
http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec2_vvp_ip.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec5b_faces_ip.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-66j-computational-cognitive-science-fall-2004/lecture-notes/sept_23_2004_fin.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-66j-computational-cognitive-science-fall-2004/lecture-notes/sept_28_2004_fin.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec5b_faces_ip.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec6_attn.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec9_pattern.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-98-neuropharmacology-january-iap-2009/lecture-notes/lecture_2.pdf http://ocw.mit.edu/courses/health-sciences-and-technology/hst-722j-brain-mechanisms-for-hearing-and-speech-fall-2005/lecture-notes/7_melcher_handot.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-66j-computational-cognitive-science-fall-2004/lecture-notes/sept_28_2004_fin.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-916-special-topics-social-animals-fall-2009/lecture-notes/MIT9_916F09_lec04.pdf — Reply to this email directly or view it on GitHub.
Not as far as I know. The process of converting a RawDocument to a Note includes the full conversion process. I don't know how to make PDF.js work with the current code. Not a bad idea if it is doable.
New error happened overnight while processing EECS, appears to be from IndexDen:
Uploading link http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-001-structure-and-interpretation-of-computer-programs-spring-2005/lecture-notes/lecture9webhand.pdf to FP.
Saving raw document to database.
Sending to GDrive and saving note to database.
this is the mimetype of the document to check:
application/pdf
...
File "/home/vagrant/karmaworld/karmaworld/apps/notes/models.py", line 257, in note_save_receiver
index.update_note(note, note.old_instance)
File "/home/vagrant/karmaworld/karmaworld/apps/notes/search.py", line 117, in update_note
self.index.add_document(new_note.id, SearchIndex._note_to_dict(new_note), variables={0: new_note.thanks})
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/client.py", line 179, in add_document
_request('PUT', self.__docs_url(), data=data)
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/client.py", line 457, in _request
raise HttpException(response.status, response.body)
indextank.client.HttpException: HTTP 500: Incorrect api call
Here's a new error due to some DB problem:
Uploading link http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-005-elements-of-software-construction-fall-2008/lecture-notes/MIT6_005f08_lec03.pdf to FP.
Saving raw document to database.
Sending to GDrive and saving note to database.
this is the mimetype of the document to check:
application/pdf
...
File "/home/vagrant/karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py", line 170, in handle
convert_raw_document(dbnote)
File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 235, in convert_raw_document
note.save()
File "/home/vagrant/karmaworld/karmaworld/apps/notes/models.py", line 218, in save
super(Note, self).save(*args, **kwargs)
File "/home/vagrant/karmaworld/karmaworld/apps/notes/models.py", line 124, in save
super(Document, self).save(*args, **kwargs)
...
File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 52, in execute
return self.cursor.execute(query, args)
django.db.utils.DatabaseError: invalid byte sequence for encoding "UTF8": 0x93
Not sure if it's relevant, but these errors seem to be coming for PowerPoint documents that have been converted to PDF...
On Wed, Jan 8, 2014 at 2:56 PM, Bryan Bonvallet notifications@github.comwrote:
Another UTF8 error on http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-005-elements-of-software-construction-fall-2008/lecture-notes/MIT6_005f08_lec08.pdf
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31871692 .
Alright, got 556 RawDocuments and 556 Notes in the database. Also 46 distinct professor fields (not to say distinct professors) and 94 courses.
Time to turn over to #89 to move these objects between systems.
I think this ticket is probably worthy of being closed. There are some outstanding comments which should be addressed, but they might be better placed in their own tickets (with a lesser priority).
If we close this ticket, let's make another for "figure out how to upload the rest of the MIT OCW content"
On Jan 8, 2014, at 4:18 PM, Bryan Bonvallet notifications@github.com wrote:
Alright, got 556 RawDocuments and 556 Notes in the database. Also 46 distinct professor fields (not to say distinct professors) and 94 courses.
Time to turn over to #89 to move these objects between systems.
I think this ticket is probably worthy of being closed. There are some outstanding comments which should be addressed, but they might be better placed in their own tickets (with a lesser priority).
— Reply to this email directly or view it on GitHub.
Actually, while looking over some of this stuff as its converted to JSON, I see some department foreign keys are missing. Investigating why they aren't there in some cases.
Turns out there are only 4 courses of 94 that came from EECS (as opposed to BCS). Maybe it would be worth running more EECS notes.
Yes. Let's run more notes.
On Jan 8, 2014, at 8:30 PM, Bryan Bonvallet notifications@github.com wrote:
Turns out there are only 4 courses of 94 that came from EECS (as opposed to BCS). Maybe it would be worth running more EECS notes.
— Reply to this email directly or view it on GitHub.
I ran another 10 or so notes before another error occurred, but it was still the same course in EECS. Man that course must have some really spotty notes ;)
So now its 560 notes across the same 94 courses. I'm going to refocus efforts on dump/restore. We can run more notes later.
On Wed, Jan 8, 2014 at 9:47 PM, Andrew Magliozzi notifications@github.comwrote:
Yes. Let's run more notes.
On Jan 8, 2014, at 8:30 PM, Bryan Bonvallet notifications@github.com wrote:
Turns out there are only 4 courses of 94 that came from EECS (as opposed to BCS). Maybe it would be worth running more EECS notes.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31898257 .
This is now held up by #273 but once that ticket is done, this process of moving notes should be super slick.
This line will need to be changed: https://github.com/FinalsClub/karmaworld/blob/feature_html_on_s3/karmaworld/apps/notes/management/commands/import_ocw_json.py#L126
Unsure if I want to edit this stuff in the feature_html_on_s3 branch...
For now, let's try some workspace changes as a proof of concept:
KarmaNotes is using CC-by on all pages.
inherit OCW CC-by-nc onto OCW pages for both course and note.
possibly create a license table. There'd be two entries to start: index 0 = CC-by, 1 = CC-by-nc. Add license FK into course and note models to license.
Default = 0 for KarmaNotes.
Importing from OCW will explicitly set license to 1.