Closed btbonval closed 9 years ago
This is still a problem. Going to tag the ticket as a bug because import_ocw_json
is presumably broken.
https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/apps/notes/management/commands/import_ocw_json.py#L106-L108
ProfessorTaught
was replaced with the many-to-many professor attribute in the course.
https://github.com/FinalsClub/karmaworld/blob/57c0252ee05a489f6218652efd8d85df830003bf/karmaworld/apps/courses/models.py#L262
There is a curious discussion about removing OCW courses without notes. https://github.com/FinalsClub/karmaworld/issues/68#issuecomment-32800490
I say curious because the code is meant to skip courses with no notes, and that code was in place more than two weeks prior to the above discussion. https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/apps/notes/management/commands/import_ocw_json.py#L110-L112
Ah, the course is saved prior to continuing onto the next course. https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/apps/notes/management/commands/import_ocw_json.py#L103
Updating the main ticket body to skip courses with no notes.
Looks like checking a course's note count can happen right at the start of the course's loop iteration. If no notes, then don't bother doing any more work.
In order to prevent department from getting created, set dbdept
to None
initially. If notes are found for a course and dbdept
is None
, then create the department at that time.
The good news is that the changes to keeping data local shouldn't need much work. convert_raw_document()
was already in use and all newer changes have been added in that function.
Uploading link http://ocw.mit.edu/courses/writing-and-humanistic-studies/21w-747-classical-rhetoric-and-modern-political-discourse-fall-2009/lecture-notes/MIT21W_747_01F09_lec03.pdf to FP.
Failed to upload note: 400 Client Error: BAD REQUEST
Probably need to include the security stuff that we use? Though bad request is not permission denied. The FP documentation moved and no longer documents the url
parameter, which references another file to download into FilePicker from the web.
https://developers.filepicker.io/docs/web/rest/#blob-store
We can't use the existing method which users use to upload, because it relies on the JS UI to deal with Filepicker and then hand back Filepicker's URL to our form.
It looks like we might have some issues with security and uploading files by URL that need to be dealt with in order to get this import up and running again.
url is not documented explicitly as such, but it is shown in a bunch of the examples. e.g.
curl -X POST -d url="https://d3urzlae3olibs.cloudfront.net/watermark.png" https://www.filepicker.io/api/store/S3?key=MY_API_KEY
Perhaps BAD REQUEST
is their way of saying bad security. We lack the signature and stuff required for the security to work.
Security for FP is used in a number of places.
Yup, signature and policy fixed it.
Now I'm getting an unadorned file system access error.
Course is in the database: Classical Rhetoric and Modern Political Discourse
Uploading link http://ocw.mit.edu/courses/writing-and-humanistic-studies/21w-747-classical-rhetoric-and-modern-political-discourse-fall-2009/lecture-notes/MIT21W_747_01F09_lec01.pdf to FP.
Saving raw document to database.
Converting document and saving note text.
this is the mimetype of the document to check:
application/pdf
WARNING:oauth2client.util:new_request() takes at most 1 positional argument (2 given)
WARNING:oauth2client.util:new_request() takes at most 1 positional argument (2 given)
text -- https://docs.google.com/feeds/download/documents/export/Export?id=1lbpzIMr1Tn03amoPh9IOovzVGq1n4x5Mu6ib0cm28_o&exportFormat=txt
downloaded!
Failed: [Errno 2] No such file or directory
Aborting.
It looks like the last known text is from this block: https://github.com/FinalsClub/karmaworld/blob/note-editing-merge-more/karmaworld/apps/notes/gdrive.py#L119-L129
It's likely that the code returned from download_from_gdrive
here:
https://github.com/FinalsClub/karmaworld/blob/note-editing-merge-more/karmaworld/apps/notes/gdrive.py#L203
Time to drop some pdb in place and find out!
New error, bypassed my pdb breakpoint:
Course is in the database: Classical Rhetoric and Modern Political Discourse
AttributeError: 'Note' object has no attribute 'html'
It looks like there are three notes in notes_note
for this course, although nothing in notes_notemarkdown
. I'm curious where Note.html
is being called from, because that hasn't been a thing for some time.
Ah looks like some outdated code remains in the import file. https://github.com/FinalsClub/karmaworld/blob/master/karmaworld/apps/notes/management/commands/import_ocw_json.py#L125
Replaced that with NoteMarkdown stuff.
FS error comes from pdf2html()
.
https://github.com/FinalsClub/karmaworld/blob/note-editing-merge-more/karmaworld/apps/notes/gdrive.py#L218
Specifically, it is from running pdf2htmlEX. https://github.com/FinalsClub/karmaworld/blob/note-editing-merge-more/karmaworld/apps/notes/gdrive.py#L77
It looks like there is no pdf2htmlEX
command on my VM. This must be one of those fixes put in place on the Heroku systems using the build pack. Yeah:
https://github.com/FinalsClub/heroku-buildpack-karmanotes/blob/master/bin/steps/pdf2htmlex
This is not part of the README as a dependency, nor does it mesh into requirements.txt
. Spawns #422
It seems like the processor just hung for some time. Of 63 notes in the department I chose, only 16 made it onto my VM database.
The code is written so that it should skip the successful notes and start up again where it left off. If the hanging problem is reproducible, it might be something I can troubleshoot.
The notes that are present look good!
sweet!
On Tue, Mar 10, 2015 at 1:44 AM, Bryan Bonvallet notifications@github.com wrote:
It seems like the processor just hung for some time. Of 63 notes in the department I chose, only 16 made it onto my VM database.
The code is written so that it should skip the successful notes and start up again where it left off. The hanging problem is reproducible, it might be something I can troubleshoot.
The notes that are present look good!
— Reply to this email directly or view it on GitHub https://github.com/FinalsClub/karmaworld/issues/359#issuecomment-77999204 .
Alright, the script ran against a department with 63 notes and then exited of its own volition. Only 37 notes are accessible, unless there's some reason I can't find them?
When I run the script again, 63 notes were "already uploaded."
I should print out the URLs if a note is already uploaded and I can review the URLs. Annoying but maybe helpful.
I found three notes with the same final URL: /note/massachusetts-institute-of-technology/writing-and-reading-short-stories-152/mit21w_755s12_workshopspdf
. That accounts for 2 missing notes.
And the entire run of course/writing-and-reading-short-stories-152/
is performed twice, including the three notes with the same URL. That accounts for 24 missing notes.
2 missing notes + 24 missing notes + 37 accounted notes = 63 notes.
Let's see what's going on with those collisions.
As far as I can tell, the one course is duplicated in full. Not sure if that is a bug with the MIT scraper or what. It seems like the importer handled it like a champ.
...
{
"courseLink": "http://ocw.mit.edu//courses/writing-and-humanistic-studies/21w-755-writing-and-reading-short-stories-spring-2012",
"courseStub": "21w-755-writing-and-reading-short-stories-spring-2012",
"courseTitle": "Writing and Reading Short Stories",
"professor": " Shariann Lewitt",
"noteLinks": [
{
...
{
"courseLink": "http://ocw.mit.edu//courses/writing-and-humanistic-studies/21w-755-writing-and-reading-short-stories-spring-2012",
"courseStub": "21w-755-writing-and-reading-short-stories-spring-2012",
"courseTitle": "Writing and Reading Short Stories",
"professor": " Shariann Lewitt",
"noteLinks": [
{
...
The workshop notes appear to be complete and perfect duplicates as well. Again, importer is performing just fine, it seems to be something with the scraper?
...
{
"link": "http://ocw.mit.edu/courses/writing-and-humanistic-studies/21w-755-writing-and-reading-short-stories-spring-2012/lecture-notes/MIT21W_755S12_workshops.pdf",
"fileName": "MIT21W_755S12_workshops.pdf"
},
...
{
"link": "http://ocw.mit.edu/courses/writing-and-humanistic-studies/21w-755-writing-and-reading-short-stories-spring-2012/lecture-notes/MIT21W_755S12_workshops.pdf",
"fileName": "MIT21W_755S12_workshops.pdf"
},
...
{
"link": "http://ocw.mit.edu/courses/writing-and-humanistic-studies/21w-755-writing-and-reading-short-stories-spring-2012/lecture-notes/MIT21W_755S12_workshops.pdf",
"fileName": "MIT21W_755S12_workshops.pdf"
},
Alright, well I see no problems with the importer here. There might be some issues with the scraper, but that is to be tracked elsewhere. In fact, the importer is clever enough to skip duplicates and seems to be populating everything just fine.
Closing this ticket.
huzzah!
On Tue, Mar 10, 2015 at 1:42 PM, Bryan Bonvallet notifications@github.com wrote:
Closed #359 https://github.com/FinalsClub/karmaworld/issues/359.
— Reply to this email directly or view it on GitHub https://github.com/FinalsClub/karmaworld/issues/359#event-249511791.
ProfessorTaught has been removed from existence in favor of the same exact thing but implicitly defined through Django ManyToManyField.
import_ocw_json still has the old reference. Needs to be updated.
It will continue to work until #358 is pulled in, but then it will break. (was pulled in a long time ago)
EDIT:
Updated task list.