FinalsClub / karmaworld

KarmaNotes.org v3.0
GNU Affero General Public License v3.0
7 stars 6 forks source link

import_ocw_json needs to be updated #359

Closed btbonval closed 9 years ago

btbonval commented 10 years ago

ProfessorTaught has been removed from existence in favor of the same exact thing but implicitly defined through Django ManyToManyField.

import_ocw_json still has the old reference. Needs to be updated.

It will continue to work until #358 is pulled in, but then it will break. (was pulled in a long time ago)

EDIT:

Updated task list.

btbonval commented 10 years ago

This is still a problem. Going to tag the ticket as a bug because import_ocw_json is presumably broken. https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/apps/notes/management/commands/import_ocw_json.py#L106-L108

ProfessorTaught was replaced with the many-to-many professor attribute in the course. https://github.com/FinalsClub/karmaworld/blob/57c0252ee05a489f6218652efd8d85df830003bf/karmaworld/apps/courses/models.py#L262

btbonval commented 9 years ago

There is a curious discussion about removing OCW courses without notes. https://github.com/FinalsClub/karmaworld/issues/68#issuecomment-32800490

I say curious because the code is meant to skip courses with no notes, and that code was in place more than two weeks prior to the above discussion. https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/apps/notes/management/commands/import_ocw_json.py#L110-L112

btbonval commented 9 years ago

Ah, the course is saved prior to continuing onto the next course. https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/apps/notes/management/commands/import_ocw_json.py#L103

Updating the main ticket body to skip courses with no notes.

Looks like checking a course's note count can happen right at the start of the course's loop iteration. If no notes, then don't bother doing any more work.

In order to prevent department from getting created, set dbdept to None initially. If notes are found for a course and dbdept is None, then create the department at that time.

btbonval commented 9 years ago

The good news is that the changes to keeping data local shouldn't need much work. convert_raw_document() was already in use and all newer changes have been added in that function.

btbonval commented 9 years ago
Uploading link http://ocw.mit.edu/courses/writing-and-humanistic-studies/21w-747-classical-rhetoric-and-modern-political-discourse-fall-2009/lecture-notes/MIT21W_747_01F09_lec03.pdf to FP.
Failed to upload note: 400 Client Error: BAD REQUEST
btbonval commented 9 years ago

Probably need to include the security stuff that we use? Though bad request is not permission denied. The FP documentation moved and no longer documents the url parameter, which references another file to download into FilePicker from the web. https://developers.filepicker.io/docs/web/rest/#blob-store

btbonval commented 9 years ago

We can't use the existing method which users use to upload, because it relies on the JS UI to deal with Filepicker and then hand back Filepicker's URL to our form.

It looks like we might have some issues with security and uploading files by URL that need to be dealt with in order to get this import up and running again.

btbonval commented 9 years ago

url is not documented explicitly as such, but it is shown in a bunch of the examples. e.g. curl -X POST -d url="https://d3urzlae3olibs.cloudfront.net/watermark.png" https://www.filepicker.io/api/store/S3?key=MY_API_KEY

Perhaps BAD REQUEST is their way of saying bad security. We lack the signature and stuff required for the security to work.

btbonval commented 9 years ago

Security for FP is used in a number of places.

btbonval commented 9 years ago

Yup, signature and policy fixed it.

Now I'm getting an unadorned file system access error.

Course is in the database: Classical Rhetoric and Modern Political Discourse
Uploading link http://ocw.mit.edu/courses/writing-and-humanistic-studies/21w-747-classical-rhetoric-and-modern-political-discourse-fall-2009/lecture-notes/MIT21W_747_01F09_lec01.pdf to FP.
Saving raw document to database.
Converting document and saving note text.
this is the mimetype of the document to check:
application/pdf

WARNING:oauth2client.util:new_request() takes at most 1 positional argument (2 given)
WARNING:oauth2client.util:new_request() takes at most 1 positional argument (2 given)

text -- https://docs.google.com/feeds/download/documents/export/Export?id=1lbpzIMr1Tn03amoPh9IOovzVGq1n4x5Mu6ib0cm28_o&exportFormat=txt
         downloaded!
Failed: [Errno 2] No such file or directory
Aborting.
btbonval commented 9 years ago

It looks like the last known text is from this block: https://github.com/FinalsClub/karmaworld/blob/note-editing-merge-more/karmaworld/apps/notes/gdrive.py#L119-L129

It's likely that the code returned from download_from_gdrive here: https://github.com/FinalsClub/karmaworld/blob/note-editing-merge-more/karmaworld/apps/notes/gdrive.py#L203

Time to drop some pdb in place and find out!

btbonval commented 9 years ago

New error, bypassed my pdb breakpoint:

Course is in the database: Classical Rhetoric and Modern Political Discourse
AttributeError: 'Note' object has no attribute 'html'

It looks like there are three notes in notes_note for this course, although nothing in notes_notemarkdown. I'm curious where Note.html is being called from, because that hasn't been a thing for some time.

Ah looks like some outdated code remains in the import file. https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/apps/notes/management/commands/import_ocw_json.py#L125

Replaced that with NoteMarkdown stuff.

btbonval commented 9 years ago

FS error comes from pdf2html(). https://github.com/FinalsClub/karmaworld/blob/note-editing-merge-more/karmaworld/apps/notes/gdrive.py#L218

Specifically, it is from running pdf2htmlEX. https://github.com/FinalsClub/karmaworld/blob/note-editing-merge-more/karmaworld/apps/notes/gdrive.py#L77

It looks like there is no pdf2htmlEX command on my VM. This must be one of those fixes put in place on the Heroku systems using the build pack. Yeah: https://github.com/FinalsClub/heroku-buildpack-karmanotes/blob/master/bin/steps/pdf2htmlex

This is not part of the README as a dependency, nor does it mesh into requirements.txt. Spawns #422

btbonval commented 9 years ago

It seems like the processor just hung for some time. Of 63 notes in the department I chose, only 16 made it onto my VM database.

The code is written so that it should skip the successful notes and start up again where it left off. If the hanging problem is reproducible, it might be something I can troubleshoot.

The notes that are present look good!

AndrewMagliozzi commented 9 years ago

sweet!

On Tue, Mar 10, 2015 at 1:44 AM, Bryan Bonvallet notifications@github.com wrote:

It seems like the processor just hung for some time. Of 63 notes in the department I chose, only 16 made it onto my VM database.

The code is written so that it should skip the successful notes and start up again where it left off. The hanging problem is reproducible, it might be something I can troubleshoot.

The notes that are present look good!

— Reply to this email directly or view it on GitHub https://github.com/FinalsClub/karmaworld/issues/359#issuecomment-77999204 .

btbonval commented 9 years ago

Alright, the script ran against a department with 63 notes and then exited of its own volition. Only 37 notes are accessible, unless there's some reason I can't find them?

When I run the script again, 63 notes were "already uploaded."

I should print out the URLs if a note is already uploaded and I can review the URLs. Annoying but maybe helpful.

btbonval commented 9 years ago

I found three notes with the same final URL: /note/massachusetts-institute-of-technology/writing-and-reading-short-stories-152/mit21w_755s12_workshopspdf. That accounts for 2 missing notes.

And the entire run of course/writing-and-reading-short-stories-152/ is performed twice, including the three notes with the same URL. That accounts for 24 missing notes.

2 missing notes + 24 missing notes + 37 accounted notes = 63 notes.

Let's see what's going on with those collisions.

btbonval commented 9 years ago

As far as I can tell, the one course is duplicated in full. Not sure if that is a bug with the MIT scraper or what. It seems like the importer handled it like a champ.

...
    {
      "courseLink": "http://ocw.mit.edu//courses/writing-and-humanistic-studies/21w-755-writing-and-reading-short-stories-spring-2012",
      "courseStub": "21w-755-writing-and-reading-short-stories-spring-2012",
      "courseTitle": "Writing and Reading Short Stories",
      "professor": " Shariann Lewitt",
      "noteLinks": [
        {
...
    {
      "courseLink": "http://ocw.mit.edu//courses/writing-and-humanistic-studies/21w-755-writing-and-reading-short-stories-spring-2012",
      "courseStub": "21w-755-writing-and-reading-short-stories-spring-2012",
      "courseTitle": "Writing and Reading Short Stories",
      "professor": " Shariann Lewitt",
      "noteLinks": [
        {
...
btbonval commented 9 years ago

The workshop notes appear to be complete and perfect duplicates as well. Again, importer is performing just fine, it seems to be something with the scraper?

...
        {
          "link": "http://ocw.mit.edu/courses/writing-and-humanistic-studies/21w-755-writing-and-reading-short-stories-spring-2012/lecture-notes/MIT21W_755S12_workshops.pdf",
          "fileName": "MIT21W_755S12_workshops.pdf"
        },
...
        {
          "link": "http://ocw.mit.edu/courses/writing-and-humanistic-studies/21w-755-writing-and-reading-short-stories-spring-2012/lecture-notes/MIT21W_755S12_workshops.pdf",
          "fileName": "MIT21W_755S12_workshops.pdf"
        },
...
        {
          "link": "http://ocw.mit.edu/courses/writing-and-humanistic-studies/21w-755-writing-and-reading-short-stories-spring-2012/lecture-notes/MIT21W_755S12_workshops.pdf",
          "fileName": "MIT21W_755S12_workshops.pdf"
        },
btbonval commented 9 years ago

Alright, well I see no problems with the importer here. There might be some issues with the scraper, but that is to be tracked elsewhere. In fact, the importer is clever enough to skip duplicates and seems to be populating everything just fine.

Closing this ticket.

AndrewMagliozzi commented 9 years ago

huzzah!

On Tue, Mar 10, 2015 at 1:42 PM, Bryan Bonvallet notifications@github.com wrote:

Closed #359 https://github.com/FinalsClub/karmaworld/issues/359.

— Reply to this email directly or view it on GitHub https://github.com/FinalsClub/karmaworld/issues/359#event-249511791.