FinalsClub / karmaworld

KarmaNotes.org v3.0
GNU Affero General Public License v3.0
7 stars 6 forks source link

Import MIT Notes #68

Closed sethwoodworth closed 10 years ago

sethwoodworth commented 11 years ago

KarmaNotes is using CC-by on all pages.

inherit OCW CC-by-nc onto OCW pages for both course and note.

possibly create a license table. There'd be two entries to start: index 0 = CC-by, 1 = CC-by-nc. Add license FK into course and note models to license.

Default = 0 for KarmaNotes.

Importing from OCW will explicitly set license to 1.

AndrewMagliozzi commented 11 years ago

Lets also make sure we add a link to the original page so we maintain the CC-BY-NC compliance.

On Jan 31, 2013, at 6:41 PM, Seth Woodworth notifications@github.com wrote:

Consider adding a ten of the top courses with video from MIT OCW as well. Do this by hand and add no more than 15 courses per network

— Reply to this email directly or view it on GitHub.

sethwoodworth commented 11 years ago

CC-BY-NC licensing is now issue #97

AndrewMagliozzi commented 10 years ago

Here is a link to the scraper for the MIT-OCW site: https://github.com/AndrewMagliozzi/mit-ocw-scraper (make sure to checkout the MIT-notes branch)

btbonval commented 10 years ago
      "courseLink": "http://ocw.mit.edu//courses/nuclear-engineering/22-105-electromagnetic-interactions-fall-2005",
      "courseStub": "22-105-electromagnetic-interactions-fall-2005",
      "courseTitle": "Electromagnetic Interactions",
      "professor": "Prof. Jeffrey Freidberg",
      "noteLinks": [
        {
          "link": "http://ocw.mit.edu/courses/nuclear-engineering/22-105-electromagnetic-interactions-fall-2005/lecture-notes/lecture1.pdf",
          "fileName": "lecture1.pdf"
        },

parse out course info: year from courseLink, title, professor.

parse out all links as notes. parse note info: title from fileName, no email address, tags of mit-ocw and karma.

btbonval commented 10 years ago

Also modify the database for handling licenses.

btbonval commented 10 years ago

pass remote link to FilePicker. (figure that bit out)

how to convert filepicker results and shove them into database: https://github.com/FinalsClub/karmaworld/blob/c5af62fe0c2d14f2420f1eef0ab577b95f2e68d9/karmaworld/apps/document_upload/tests.py

btbonval commented 10 years ago

license handling of #97 is done in commit 34ea96f09d5c5c80748526171ad5d3c44aef0679

btbonval commented 10 years ago

looks like there is no pythonic interface to FilePicker. Best answer seems to always be curl. http://stackoverflow.com/questions/14115280/store-files-to-filepicker-io-from-the-command-line

Might as well implement something with urllib or whatevs, grab the API key out of secrets, whatnot.

btbonval commented 10 years ago

hrm. curl -F blah=@file will use multipart/form-data to upload files as though submit to a form. This is recommended by the above stackoverflow and on Filepicker's RESTful API: https://developers.inkfilepicker.com/docs/web/#inkblob-store

However, when I upload files using requests multipart/form-data, the MIME type returned by Filepicker is "multipart/form-data" rather than the MIME type of the actual file. http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file

btbonval commented 10 years ago

I give up for now. No matter what I do, Filepicker says the file type is "multipart/form-data", yet I see no reason for this. Check back with fresh eyes.

commit to feature_ocw_upload in 3eb6d5eba963c7f30011ec330e9465f1670c5e95

only other thing I can think of is to pass in the byte array using dlresp.content instead of the file-like object of dlresp.raw, but that shouldn't change how the files parameter works for the requests POST (and thus should not effect the mimetype interpretation). worth a try tho.

btbonval commented 10 years ago

this is the bit that won't seem to upload properly: https://github.com/FinalsClub/karmaworld/blob/3eb6d5eba963c7f30011ec330e9465f1670c5e95/karmaworld/apps/notes/management/commands/import_ocw_json.py#L95-L102

AndrewMagliozzi commented 10 years ago

Is there an option to do a buffered download?

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com wrote:

this is the bit that won't seem to upload properly: https://github.com/FinalsClub/karmaworld/blob/3eb6d5eba963c7f30011ec330e9465f1670c5e95/karmaworld/apps/notes/management/commands/import_ocw_json.py#L95-L102

— Reply to this email directly or view it on GitHub.

AndrewMagliozzi commented 10 years ago

I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com wrote:

this is the bit that won't seem to upload properly: https://github.com/FinalsClub/karmaworld/blob/3eb6d5eba963c7f30011ec330e9465f1670c5e95/karmaworld/apps/notes/management/commands/import_ocw_json.py#L95-L102

— Reply to this email directly or view it on GitHub.

btbonval commented 10 years ago

Have to take cat to vet shortly, but I'll be ready to take a look when I get back.

Good thought on uploading via link, but I didn't see how to do that via FP RESTful API docs. Should be possible. On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com wrote:

I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com wrote:

this is the bit that won't seem to upload properly:

https://github.com/FinalsClub/karmaworld/blob/3eb6d5eba963c7f30011ec330e9465f1670c5e95/karmaworld/apps/notes/management/commands/import_ocw_json.py#L95-L102

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31655516 .

AndrewMagliozzi commented 10 years ago

I think you can just pass the URL instead of the local file path. Let's try it when you get back.

On Mon, Jan 6, 2014 at 1:07 PM, Bryan Bonvallet notifications@github.comwrote:

Have to take cat to vet shortly, but I'll be ready to take a look when I get back.

Good thought on uploading via link, but I didn't see how to do that via FP RESTful API docs. Should be possible. On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com wrote:

I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com

wrote:

this is the bit that won't seem to upload properly:

https://github.com/FinalsClub/karmaworld/blob/3eb6d5eba963c7f30011ec330e9465f1670c5e95/karmaworld/apps/notes/management/commands/import_ocw_json.py#L95-L102

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub< https://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31655516> .

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31670922 .

AndrewMagliozzi commented 10 years ago

curl -X POST -d "url=palmzlib.sourceforge.net/images/pengbrew.png"; " filepicker.io/api/store/S3?key=MY_API_KEY&path=/images/…;

On Mon, Jan 6, 2014 at 3:28 PM, Andrew Magliozzi <andrew.magliozzi@gmail.com

wrote:

I think you can just pass the URL instead of the local file path. Let's try it when you get back.

On Mon, Jan 6, 2014 at 1:07 PM, Bryan Bonvallet notifications@github.comwrote:

Have to take cat to vet shortly, but I'll be ready to take a look when I get back.

Good thought on uploading via link, but I didn't see how to do that via FP RESTful API docs. Should be possible. On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com wrote:

I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com

wrote:

this is the bit that won't seem to upload properly:

https://github.com/FinalsClub/karmaworld/blob/3eb6d5eba963c7f30011ec330e9465f1670c5e95/karmaworld/apps/notes/management/commands/import_ocw_json.py#L95-L102

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub< https://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31655516>

.

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31670922 .

btbonval commented 10 years ago

aha, it's in the API.

curl -X POST -d url="https://www.inkfilepicker.com/static/img/watermark.png" https://www.filepicker.io/api/store/S3?key=MY_API_KEY

This is how you specify the URL to FP and let them download it.

btbonval commented 10 years ago

Getting non-unique error from same course over different academic years.

DETAIL:  Key (school_id, name, instructor_name)=(10464, Designing Your Life, Gabriella Jordan, Lauren Zander) already exists.

There is a unique constraint which does not include Academic Year but should.

However, there is no way to add Academic Year in the form. #253

Also we need to toss department into the import following completion of #236

btbonval commented 10 years ago

Notes are duplicating. It appears Django is deciding to insert instead of update. One note has license and upstream_link set, the other does not. There is a single call of gdrive's convert_raw_document over a single RawDocument object.

btbonval commented 10 years ago

RawDocument is updated in convert_raw_document. Note only has save called once, excepting possibly the call to sanitize_html or some other Note method which might do its own save.

btbonval commented 10 years ago

RawDocument.save calls celery to run convert_raw_document via process_raw_document.

So celery does it one time and the conversion code does it one time.

btbonval commented 10 years ago

253 is no longer the fix for Academic Year unique problems.

remove "year" from the create_or_get statement so that it grabs the correct course agnostic of year.

btbonval commented 10 years ago

VM is sucking in courses.

Start new VM from scratch, suck in ALL notes.

If that works, move to beta.

btbonval commented 10 years ago

Upload to VM one time. If everything works well, switch over to using dump_json and restore_json to bring the VM notes over to beta.

btbonval commented 10 years ago

Before testing VM, complete Professor stuff in #235. Email addresses can be added later.

btbonval commented 10 years ago

Script is updated with professor stuff in #235 and department stuff in #236.

Script is running through all JSON on VM as we speak.

If successful, the script should be all set and this ticket can be closed.

btbonval commented 10 years ago

That's a lot of notes!

$ grep '"link":' *.json | wc -l
24415
$ grep '"link":' *.json | uniq | wc -l
24289

There is concern about:

  1. transaction limits across uploads to Filepicker
  2. storage size on Google Drive
  3. storage size on EC2/Linode HDs for note html/text in the database

so yeah. let's not run all these.

Running these department's notes: BCS, EECS, Math. We'll see where that gets us.

btbonval commented 10 years ago

Got to here and errored:

Uploading link http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-59j-psycholinguistics-spring-2005/lecture-notes/0407_speech_1.pdf to FP.
Saving raw document to database.
Sending to GDrive and saving note to database.
this is the mimetype of the document to check:
application/pdf
...
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py", line 155, in handle
    convert_raw_document(dbnote)
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 202, in convert_raw_document
    file_dict = upload_to_gdrive(service, media, filename, mimetype=mimetype)
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 152, in upload_to_gdrive
    convert=True, ocr=ocr).execute()
...
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/apiclient/http.py", line 816, in _process_response
    raise HttpError(resp, content, uri=self.uri)
apiclient.errors.HttpError: <HttpError 500 when requesting https://www.googleapis.com/upload/drive/v2/files?uploadType=resumable&convert=true&ocr=true&alt=json returned "Internal Error">

Google Drive returned Error 500. That's not a good sign, and also nothing I can do about it.

btbonval commented 10 years ago

Meh. Guess Google wanted a break. Kicked on the script and it picked up where it left off. No problems for now.

btbonval commented 10 years ago

Looks like Google Drive is having a hard time chewing this doc:

Uploading link http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-66j-computational-cognitive-science-fall-2004/lecture-notes/sept_23_2004_fin.pdf to FP.
Saving raw document to database.
Sending to GDrive and saving note to database.
this is the mimetype of the document to check:
application/pdf
...
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py", line 155, in handle
    convert_raw_document(dbnote)
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 217, in convert_raw_document
    note.html = pdf2html(original_content)
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 83, in pdf2html
    raise ValueError("PDF file could not be processed")
ValueError: PDF file could not be processed

This problem repeats. Going to the site, the PDF appears to load just fine. There's some problem with pdf2html around line 83. https://github.com/FinalsClub/karmaworld/blob/52982fda8ac88654ac75c5759a09c0f67a7aa9cd/karmaworld/apps/notes/gdrive.py#L83

btbonval commented 10 years ago

Removed that note from the JSON file and kicked the process off. It'll keep working through BCS notes in the meantime.

btbonval commented 10 years ago

Another pdf2html failure on http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec2_vvp_ip.pdf

btbonval commented 10 years ago

Another failure on http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec5b_faces_ip.pdf.

I think I need to write some code in the import thing that allows it to log problems like this to file (for later review) but continues to run.

btbonval commented 10 years ago

Just saw a pdf2html error: *** glibc detected *** pdf2htmlEX: double free or corruption (out): 0xb75a2008 ***

Error recovery seems to be working. If a note fails to convert, it is removed from the database instead of being left with empty html/text fields. commit in 732f89f98b3126f81a0d79b48d5009b8edd48d16

Since some notes failed to convert before deletion was added to the code, notes are tested for being partial as they are parsed in JSON. If a partial note is found, it is deleted and then reprocessed as though it hadn't been there at all. This means partial notes are removed and then convert is run on the rawdocument all over again.

btbonval commented 10 years ago

I think gdrive_url is being cached. We should make use of this somewhere to prevent uploading the same file many times.

btbonval commented 10 years ago

Finished BCS notes. That took forever. ~439 notes. PDF errors:

btbonval commented 10 years ago

EECS is now parsing in my VM.

AndrewMagliozzi commented 10 years ago

Strange those notes errored. There doesn't seem to be anything unusual about them. If pdf2html fails, do we roll back to PDF.js for display?

On Jan 8, 2014, at 4:11 AM, Bryan Bonvallet notifications@github.com wrote:

Finished BCS notes. That took forever. ~439 notes. PDF errors:

http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec2_vvp_ip.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec5b_faces_ip.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-66j-computational-cognitive-science-fall-2004/lecture-notes/sept_23_2004_fin.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-66j-computational-cognitive-science-fall-2004/lecture-notes/sept_28_2004_fin.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec5b_faces_ip.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec6_attn.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-71-functional-mri-of-high-level-vision-fall-2007/lecture-notes/lec9_pattern.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-98-neuropharmacology-january-iap-2009/lecture-notes/lecture_2.pdf http://ocw.mit.edu/courses/health-sciences-and-technology/hst-722j-brain-mechanisms-for-hearing-and-speech-fall-2005/lecture-notes/7_melcher_handot.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-66j-computational-cognitive-science-fall-2004/lecture-notes/sept_28_2004_fin.pdf http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-916-special-topics-social-animals-fall-2009/lecture-notes/MIT9_916F09_lec04.pdf — Reply to this email directly or view it on GitHub.

btbonval commented 10 years ago

Not as far as I know. The process of converting a RawDocument to a Note includes the full conversion process. I don't know how to make PDF.js work with the current code. Not a bad idea if it is doable.

New error happened overnight while processing EECS, appears to be from IndexDen:

Uploading link http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-001-structure-and-interpretation-of-computer-programs-spring-2005/lecture-notes/lecture9webhand.pdf to FP.
Saving raw document to database.
Sending to GDrive and saving note to database.
this is the mimetype of the document to check:
application/pdf
...
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/models.py", line 257, in note_save_receiver
    index.update_note(note, note.old_instance)
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/search.py", line 117, in update_note
    self.index.add_document(new_note.id, SearchIndex._note_to_dict(new_note), variables={0: new_note.thanks})
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/client.py", line 179, in add_document
    _request('PUT', self.__docs_url(), data=data)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/client.py", line 457, in _request
    raise HttpException(response.status, response.body) 
indextank.client.HttpException: HTTP 500: Incorrect api call
btbonval commented 10 years ago

Here's a new error due to some DB problem:

Uploading link http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-005-elements-of-software-construction-fall-2008/lecture-notes/MIT6_005f08_lec03.pdf to FP.
Saving raw document to database.
Sending to GDrive and saving note to database.
this is the mimetype of the document to check:
application/pdf
...
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py", line 170, in handle
    convert_raw_document(dbnote)
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 235, in convert_raw_document
    note.save()
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/models.py", line 218, in save
    super(Note, self).save(*args, **kwargs)
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/models.py", line 124, in save
    super(Document, self).save(*args, **kwargs)
...
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 52, in execute
    return self.cursor.execute(query, args)
django.db.utils.DatabaseError: invalid byte sequence for encoding "UTF8": 0x93
btbonval commented 10 years ago

Another UTF8 error on http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-005-elements-of-software-construction-fall-2008/lecture-notes/MIT6_005f08_lec08.pdf

AndrewMagliozzi commented 10 years ago

Not sure if it's relevant, but these errors seem to be coming for PowerPoint documents that have been converted to PDF...

On Wed, Jan 8, 2014 at 2:56 PM, Bryan Bonvallet notifications@github.comwrote:

Another UTF8 error on http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-005-elements-of-software-construction-fall-2008/lecture-notes/MIT6_005f08_lec08.pdf

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31871692 .

btbonval commented 10 years ago

Alright, got 556 RawDocuments and 556 Notes in the database. Also 46 distinct professor fields (not to say distinct professors) and 94 courses.

Time to turn over to #89 to move these objects between systems.

I think this ticket is probably worthy of being closed. There are some outstanding comments which should be addressed, but they might be better placed in their own tickets (with a lesser priority).

AndrewMagliozzi commented 10 years ago

If we close this ticket, let's make another for "figure out how to upload the rest of the MIT OCW content"

On Jan 8, 2014, at 4:18 PM, Bryan Bonvallet notifications@github.com wrote:

Alright, got 556 RawDocuments and 556 Notes in the database. Also 46 distinct professor fields (not to say distinct professors) and 94 courses.

Time to turn over to #89 to move these objects between systems.

I think this ticket is probably worthy of being closed. There are some outstanding comments which should be addressed, but they might be better placed in their own tickets (with a lesser priority).

— Reply to this email directly or view it on GitHub.

btbonval commented 10 years ago

Actually, while looking over some of this stuff as its converted to JSON, I see some department foreign keys are missing. Investigating why they aren't there in some cases.

btbonval commented 10 years ago

Turns out there are only 4 courses of 94 that came from EECS (as opposed to BCS). Maybe it would be worth running more EECS notes.

AndrewMagliozzi commented 10 years ago

Yes. Let's run more notes.

On Jan 8, 2014, at 8:30 PM, Bryan Bonvallet notifications@github.com wrote:

Turns out there are only 4 courses of 94 that came from EECS (as opposed to BCS). Maybe it would be worth running more EECS notes.

— Reply to this email directly or view it on GitHub.

btbonval commented 10 years ago

I ran another 10 or so notes before another error occurred, but it was still the same course in EECS. Man that course must have some really spotty notes ;)

So now its 560 notes across the same 94 courses. I'm going to refocus efforts on dump/restore. We can run more notes later.

On Wed, Jan 8, 2014 at 9:47 PM, Andrew Magliozzi notifications@github.comwrote:

Yes. Let's run more notes.

On Jan 8, 2014, at 8:30 PM, Bryan Bonvallet notifications@github.com wrote:

Turns out there are only 4 courses of 94 that came from EECS (as opposed to BCS). Maybe it would be worth running more EECS notes.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31898257 .

btbonval commented 10 years ago

This is now held up by #273 but once that ticket is done, this process of moving notes should be super slick.

btbonval commented 10 years ago

273 is far enough along to fix up the OCW import code. Actually, it looks like the changes made should already be incorporated:

https://github.com/FinalsClub/karmaworld/blob/feature_html_on_s3/karmaworld/apps/notes/management/commands/import_ocw_json.py#L171

This line will need to be changed: https://github.com/FinalsClub/karmaworld/blob/feature_html_on_s3/karmaworld/apps/notes/management/commands/import_ocw_json.py#L126

Unsure if I want to edit this stuff in the feature_html_on_s3 branch...

For now, let's try some workspace changes as a proof of concept:

  1. remove IndexDen so that won't error
  2. add a counter so no more than 3 notes are uploaded
  3. check that the notes are on S3 and display on the local website