FinalsClub / karmaworld

KarmaNotes.org v3.0
GNU Affero General Public License v3.0
7 stars 6 forks source link

Import MIT Notes #68

Closed sethwoodworth closed 10 years ago

sethwoodworth commented 11 years ago

KarmaNotes is using CC-by on all pages.

inherit OCW CC-by-nc onto OCW pages for both course and note.

possibly create a license table. There'd be two entries to start: index 0 = CC-by, 1 = CC-by-nc. Add license FK into course and note models to license.

Default = 0 for KarmaNotes.

Importing from OCW will explicitly set license to 1.

btbonval commented 10 years ago

Made all the SearchIndex methods return so they won't flip out. Added a counter for three courses. Tried to import.

This is weird. The script tries to retrieve a school, but never tries to save the school.

(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py import_ocw_json .
IntegrityError: null value in column "school_id" violates not-null constraint

School retrieval works just fine. School.id is not null, so it shouldn't be set to null anywhere it is saved to.

In [1]: from apps.courses.models import School
In [2]: dbschool = School.objects.filter(usde_id=121415)[0]
In [3]: dbschool
Out[3]: <School: School 121415: Massachusetts Institute of Technology>
In [4]: dbschool.id
Out[4]: 1839
btbonval commented 10 years ago

Only three models save school_id.

courses/models.py:    school      = models.ForeignKey(School) # Should this be optional ever?
courses/models.py:    school      = models.ForeignKey(School) 
users/models.py:    school    = models.ForeignKey(School, blank=True, null=True)

Users isn't touched in this file. Department and Course have FKs. Department and Course both save the retrieved school. Might need to pdb this one.

btbonval commented 10 years ago

Got it. Course uses get_or_create() with only name and department. If the Course doesn't exist, then it is created, but it isn't given a School at that time.

not null probably doesn't add much to Courses given that school will be removed in favor of department. For now, I'm just gonna remove that in the working database.

btbonval commented 10 years ago

Man these errors are terrible without tracebacks. Something about this debug stuff actually prevents tracebacks, they used to show up in prod mode on the VM, but this prod/dev chimera doesn't show tracebacks. Either that or a change was made to the Django env.

Course is in the database: Introduction to Neuroscience
Uploading link http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-01-introduction-to-neuroscience-fall-2007/lecture-notes/09_vision1.pdf to FP.
HTTPError: 403 Client Error: FORBIDDEN
btbonval commented 10 years ago

403 from here: https://github.com/FinalsClub/karmaworld/blob/200ca5cdb1e2faa36f10906e1bc2da2215aafe80/karmaworld/apps/notes/management/commands/import_ocw_json.py#L142-L145

Filepicker didn't used to return forbidden.

btbonval commented 10 years ago

We did just change the filepicker API. I suppose it won't hurt to use the old one as a test.

yup. original filepicker API works fine, newer one fails. Does that mean beta Filepicker will fail?

btbonval commented 10 years ago

Things look mostly good. The static URL is returning error 403. https://s3.amazonaws.com/karma-beta/html/09_vision1pdf.html

While this link works just fine both in a new tab and imported onto the page: https://s3.amazonaws.com/karma-beta/css/global.css

btbonval commented 10 years ago

folders within buckets do not have special permissions. buckets have permissions as a whole.

farg.

btbonval commented 10 years ago

Interestingly, "Static web hosting" is not enabled for the bucket at all. So whatever we're doing, we're not checking those tick marks.

Man, I remember this from before. there's some evil voodoo crap going on. Some things work and some things do not work. Last time I had to nuke the VM and start over and suddenly CSS and so forth started working from S3. No changes to the S3 server made.

btbonval commented 10 years ago

The original S3 static hosting instructions used certainly did not mention anything at all about changing S3 settings themselves, just how to make Django push static files up to S3. https://github.com/FinalsClub/karmaworld/issues/65#issuecomment-12974597

Previous dark time with no real resolution: https://github.com/FinalsClub/karmaworld/issues/192

I think we're not doing it right, but somehow we're getting lucky.

btbonval commented 10 years ago

Each S3 object has its permissions. There is no way to inherit permissions from the bucket. There is no way to batch apply permissions across all objects in a bucket through the S3 interface.

The only answer here is to change permissions on the Key at upload time in the Note.send_to_s3() code.

btbonval commented 10 years ago

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

  1. Can we check if the fp_file URL is owned by the current Filepicker API?
  2. Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over?
AndrewMagliozzi commented 10 years ago

I don't believe there is a way to check which account a link belongs to.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

Can we check if the fp_file URL is owned by the current Filepicker API? Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over? — Reply to this email directly or view it on GitHub.

AndrewMagliozzi commented 10 years ago

We'll have to use the prod Filepicker account creds on your VM for the MIT data.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

Can we check if the fp_file URL is owned by the current Filepicker API? Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over? — Reply to this email directly or view it on GitHub.

btbonval commented 10 years ago

I'm hoping the Filepicker API will have something clever I can use. Even if it has to do it by checksum (which could be slow, but worthwhile).

Actually I have been using the prod Filepicker account on my VM, which has a secondary benefit: I can see the HTML uploaded on S3.

Beta's Filepicker uploads to Filepicker S3 or whatever that we don't have access to.

On Thu, Jan 16, 2014 at 8:49 PM, Andrew Magliozzi notifications@github.comwrote:

I don't believe there is a way to check which account a link belongs to.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

Can we check if the fp_file URL is owned by the current Filepicker API? Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over? — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-32573422 .

btbonval commented 10 years ago

You're right. Nothing helpful with the Filepicker API. You can CRUD each file given its filepicker URL, but there isn't even a way to list files. https://developers.inkfilepicker.com/docs/web/#rest

That puts a very minor wrench in the cogs. It means we won't be able to test this import stuff on beta without pointing at prod's static S3 URL. Easy thing to do for a quick read test, and then change it back. -Bryan

On Thu, Jan 16, 2014 at 8:50 PM, Bryan btbonval@gmail.com wrote:

I'm hoping the Filepicker API will have something clever I can use. Even if it has to do it by checksum (which could be slow, but worthwhile).

Actually I have been using the prod Filepicker account on my VM, which has a secondary benefit: I can see the HTML uploaded on S3.

Beta's Filepicker uploads to Filepicker S3 or whatever that we don't have access to.

On Thu, Jan 16, 2014 at 8:49 PM, Andrew Magliozzi < notifications@github.com> wrote:

I don't believe there is a way to check which account a link belongs to.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

Can we check if the fp_file URL is owned by the current Filepicker API? Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over? — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-32573422 .

btbonval commented 10 years ago

Since the first 15 or so notes and were converted to HTML poorly, I deleted them on the S3. (I also deleted the other things converted poorly with HTML in the database) https://github.com/FinalsClub/karmaworld/issues/273#issuecomment-32675516

Reran populate_s3 to fix the stuff with HTML in the database.

That left 15 notes that aren't statically hosted on S3, of which a few are from the previous import OCW tests:

karmanotes=# SELECT cc.id, cc.slug FROM courses_course AS cc, notes_note AS nn WHERE nn.static_html = FALSE AND cc.id = nn.course_id;
 id  |                         slug                          
-----+-------------------------------------------------------
  52 | economics-10
  65 | culture-and-belief-17-the-roman-games
  76 | societies-of-the-world-39-slavery-and-slave-trade
  45 | metaphysical-poetry
  55 | history-1330-social-thought-in-modern-america
  39 | government-1295-comparative-politics-in-latin-america
  46 | psychology-13-cognitive-psychology
  48 | us-and-the-world-13-medicine-and-society-in-america
  54 | government-1540-the-american-presidency
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
(15 rows)

Intro to Neuroscience is from the previous OCW attempt. That has been cascade deleted and will be repopulated with the import OCW script.

What are these other things? They are all notes which have null html and null text.

karmanotes=# SELECT course_id, slug, length(html) AS html_len, length(text) AS text_len FROM notes_note WHERE static_html = FALSE;
 course_id |                       slug                       | html_len | text_
len 
-----------+--------------------------------------------------+----------+----------
        52 | aggregate-demand-componentspdf                   |          |         
        65 | the-roman-games-study-guide                      |          |         
        76 | slavery-and-slave-trade-study-guide-11-9-378297  |          |         
        45 | classnotes-from-22305                            |          |         
        55 | guide-to-jello                                   |          |         
        39 | comparative-politics-of-latin-americ-class-notes |          |         
        46 | cognitive-psychology-notes                       |          |         
        48 | medicine-and-society-midterm-2-guide-11-9-60087  |          |         
        54 | the-american-presidency-study-guide              |          |         
(9 rows)
karmanotes=# SELECT length(NULL);
 length 
--------

(1 row)
karmanotes=# SELECT cc.name,nn.name,nn.uploaded_at,fp_file,mimetype,file_type,pdf_file,gdrive_url FROM notes_note AS nn, courses_course AS cc WHERE static_html = FALSE AND cc.id = nn.course_id;
                          name                           |                        name                        |          uploaded_at          | fp_file | mimetype | file_type | pdf_file | gdrive_url 
---------------------------------------------------------+----------------------------------------------------+-------------------------------+---------+----------+-----------+----------+------------
 Economics 10                                            | Aggregate Demand Components.pdf                    | 2013-11-09 18:11:36.495527+00 |         |          | ???       |          | 
 Culture and Belief 17 - The Roman Games                 | The Roman Games - Study Guide                      | 2013-11-09 18:11:50.345225+00 |         |          | ???       |          | 
 Societies of the World 39 - Slavery and Slave Trade     | Slavery and Slave Trade - Study Guide              | 2013-11-09 18:11:47.378297+00 |         |          | ???       |          | 
 Metaphysical Poetry                                     | Classnotes from 2/23/05                            | 2013-11-09 18:11:43.725581+00 |         |          | ???       |          | 
 History 1330 - Social Thought in Modern America         | Guide to Jello                                     | 2013-11-09 18:11:43.736942+00 |         |          | ???       |          | 
 Government 1295 - Comparative Politics in Latin America | Comparative Politics of Latin Americ - Class Notes | 2013-11-09 18:11:49.267428+00 |         |          | ???       |          | 
 Psychology 13 - Cognitive Psychology                    | Cognitive Psychology - Notes                       | 2013-11-09 18:11:46.11973+00  |         |          | ???       |          | 
 US and the World 13 - Medicine and Society in America   | Medicine and Society - Midterm 2 Guide             | 2013-11-09 18:11:47.060087+00 |         |          | ???       |          | 
 Government 1540 - The American Presidency               | The American Presidency - Study Guide              | 2013-11-09 18:11:49.523136+00 |         |          | ???       |          | 
(9 rows)

Interestingly all the blank notes were uploaded On 9 November 2013. Probably not a coincidence. There is no information which might help recover these files besides the note name and course name. Even then only the originator would know what that name that file refers to. Deleting those notes from the database.

btbonval commented 10 years ago

Running MIT OCW BCS dept notes on production in tmux window.

First note finished and shows up in the right course. http://www.karmanotes.org/massachusetts-institute-of-technology/introduction-to-neuroscience-121/09_vision1pdf

Looks good. links open in a new window. Will leave the script running and check on it later.

AndrewMagliozzi commented 10 years ago

I think I can find those blank files again. Stay tuned.

Andrew

On Jan 18, 2014, at 1:31 AM, Bryan Bonvallet notifications@github.com wrote:

Since the first 15 or so notes and were converted to HTML poorly, I deleted them on the S3. (I also deleted the other things converted poorly with HTML in the database)

273 (comment)

Reran populate_s3 to fix the stuff with HTML in the database.

That left 15 notes that aren't statically hosted on S3, of which a few are from the previous import OCW tests:

karmanotes=# SELECT cc.id, cc.slug FROM courses_course AS cc, notes_note AS nn WHERE nn.static_html = FALSE AND cc.id = nn.course_id; id | slug
-----+------------------------------------------------------- 52 | economics-10 65 | culture-and-belief-17-the-roman-games 76 | societies-of-the-world-39-slavery-and-slave-trade 45 | metaphysical-poetry 55 | history-1330-social-thought-in-modern-america 39 | government-1295-comparative-politics-in-latin-america 46 | psychology-13-cognitive-psychology 48 | us-and-the-world-13-medicine-and-society-in-america 54 | government-1540-the-american-presidency 120 | introduction-to-neuroscience-120 120 | introduction-to-neuroscience-120 120 | introduction-to-neuroscience-120 120 | introduction-to-neuroscience-120 120 | introduction-to-neuroscience-120 120 | introduction-to-neuroscience-120 (15 rows) Intro to Neuroscience is from the previous OCW attempt. That has been cascade deleted and will be repopulated with the import OCW script.

What are these other things? They are all notes which have null html and null text.

karmanotes=# SELECT course_id, slug, length(html) AS html_len, length(text) AS text_len FROM notes_note WHERE static_html = FALSE; course_id | slug | htmllen | text len -----------+--------------------------------------------------+----------+---------- 52 | aggregate-demand-componentspdf | |
65 | the-roman-games-study-guide | |
76 | slavery-and-slave-trade-study-guide-11-9-378297 | |
45 | classnotes-from-22305 | |
55 | guide-to-jello | |
39 | comparative-politics-of-latin-americ-class-notes | |
46 | cognitive-psychology-notes | |
48 | medicine-and-society-midterm-2-guide-11-9-60087 | |
54 | the-american-presidency-study-guide | |
(9 rows) karmanotes=# SELECT length(NULL);

length

(1 row) karmanotes=# SELECT cc.name,nn.name,nn.uploaded_at,fp_file,mimetype,file_type,pdf_file,gdrive_url FROM notes_note AS nn, courses_course AS cc WHERE static_html = FALSE AND cc.id = nn.course_id; name | name | uploaded_at | fp_file | mimetype | file_type | pdf_file | gdrive_url ---------------------------------------------------------+----------------------------------------------------+-------------------------------+---------+----------+-----------+----------+------------ Economics 10 | Aggregate Demand Components.pdf | 2013-11-09 18:11:36.495527+00 | | | ??? | | Culture and Belief 17 - The Roman Games | The Roman Games - Study Guide | 2013-11-09 18:11:50.345225+00 | | | ??? | | Societies of the World 39 - Slavery and Slave Trade | Slavery and Slave Trade - Study Guide | 2013-11-09 18:11:47.378297+00 | | | ??? | | Metaphysical Poetry | Classnotes from 2/23/05 | 2013-11-09 18:11:43.725581+00 | | | ??? | | History 1330 - Social Thought in Modern America | Guide to Jello | 2013-11-09 18:11:43.736942+00 | | | ??? | | Government 1295 - Comparative Politics in Latin America | Comparative Politics of Latin Americ - Class Notes | 2013-11-09 18:11:49.267428+00 | | | ??? | | Psychology 13 - Cognitive Psychology | Cognitive Psychology - Notes | 2013-11-09 18:11:46.11973+00 | | | ??? | | US and the World 13 - Medicine and Society in America | Medicine and Society - Midterm 2 Guide | 2013-11-09 18:11:47.060087+00 | | | ??? | | Government 1540 - The American Presidency | The American Presidency - Study Guide | 2013-11-09 18:11:49.523136+00 | | | ??? | | (9 rows) Interestingly all the blank notes were uploaded On 9 November 2013. Probably not a coincidence. There is no information which might help recover these files besides the note name and course name. Even then only the originator would know what that name that file refers to. Deleting those notes from the database.

— Reply to this email directly or view it on GitHub.

btbonval commented 10 years ago

BCS and Chemistry department notes uploaded for MIT OCW.

Beginning Anthropology and Economics.

All the notes in Intro to Anthro are missing, but the script now skips missing upstream links:

Course is in the database: Introduction to Anthropology
Uploading link http://ocw.mit.edu/courses/anthropology/21a-100-introduction-to-a
nthropology-fall-2004/lecture-notes/Ses1_OPENER.pdf to FP.
Failed to upload note: 404 Client Error: NOT FOUND
btbonval commented 10 years ago

Wrote a quick little ditty. Notes by department (I'm shooting for departments in the middle as we prioritize):

28 , ./Athletics, Physical Education, and Recreation.json 
36 , ./Literature.json 
63 , ./Writing and Humanistic Studies.json 
82 , ./History.json 
112 , ./Women's and Gender Studies.json 
140 , ./Media Arts and Sciences.json 
151 , ./Experimental Study Group.json 
174 , ./Music and Theater Arts.json 
177 , ./Science, Technology, and Society.json 
213 , ./Comparative Media Studies.json 
226 , ./Foreign Languages and Literatures.json 
312 , ./Special Programs.json 
320 , ./Architecture.json 
330 , ./Biology.json 
347 , ./Anthropology.json 
463 , ./Political Science.json 
475 , ./Nuclear Science and Engineering.json 
478 , ./Brain and Cognitive Sciences.json 
501 , ./Biological Engineering.json 
536 , ./Chemistry.json 
553 , ./Chemical Engineering.json 
602 , ./Health Sciences and Technology.json 
704 , ./Economics.json 
727 , ./Linguistics and Philosophy.json 
857 , ./Physics.json 
883 , ./Materials Science and Engineering.json 
943 , ./Urban Studies and Planning.json 
1088 , ./Earth, Atmospheric, and Planetary Sciences.json 
1166 , ./Engineering Systems Division.json 
1361 , ./Aeronautics and Astronautics.json 
1450 , ./Mechanical Engineering.json 
1484 , ./Civil and Environmental Engineering.json 
1926 , ./Management.json 
2186 , ./Mathematics.json 
3324 , ./Electrical Engineering and Computer Science.json
btbonval commented 10 years ago

Anthropology and Economics uploaded.

Physics and PolySci, why not? Launched for import.

AndrewMagliozzi commented 10 years ago

Awesome! PS - I found two more spam courses

On Mon, Jan 20, 2014 at 4:15 AM, Bryan Bonvallet notifications@github.comwrote:

Anthropology and Economics uploaded.

Physics and PolySci, why not? Launched for import.

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-32743865 .

AndrewMagliozzi commented 10 years ago

PPS - Can we remove all courses where the professor is null?

On Mon, Jan 20, 2014 at 9:53 AM, Andrew Magliozzi < andrew.magliozzi@gmail.com> wrote:

Awesome! PS - I found two more spam courses

On Mon, Jan 20, 2014 at 4:15 AM, Bryan Bonvallet <notifications@github.com

wrote:

Anthropology and Economics uploaded.

Physics and PolySci, why not? Launched for import.

— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-32743865 .

btbonval commented 10 years ago

There are 316 notes for courses taught by null professors.

karmanotes=# SELECT count(nn.id) FROM notes_note AS nn, courses_course AS cc, courses_professortaught AS cpt WHERE nn.course_id = cc.id AND cc.id = cpt.course_id AND cpt.professor_id = 1;
 count 
-------
   316
(1 row)

Are you asking me to clear out the MIT OCW courses which have no notes? The MIT OCW script has specific tags we can search against to find all courses which were uploaded by the script and have no notes.

The subquery would look something like this (no ORDER BY clause):

SELECT cc.id, COUNT(nn.id) AS notes FROM courses_course AS cc INNER JOIN taggit_taggeditem AS tt ON (tt.object_id = cc.id) LEFT OUTER JOIN notes_note AS nn ON (cc.id = nn.course_id) WHERE tt.tag_id IN (108,109) GROUP BY cc.id ORDER BY notes ASC, cc.id ASC;
btbonval commented 10 years ago

231 MIT scraped courses have no notes. 200 MIT scraped courses have notes.

AndrewMagliozzi commented 10 years ago

That is exactly what I was thinking.

On Jan 20, 2014, at 2:53 PM, Bryan Bonvallet notifications@github.com wrote:

There are 316 notes for courses taught by null professors.

karmanotes=# SELECT count(nn.id) FROM notes_note AS nn, courses_course AS cc, courses_professortaught AS cpt WHERE nn.course_id = cc.id AND cc.id = cpt.course_id AND cpt.professor_id = 1;

count

316 (1 row) Are you asking me to clear out the MIT OCW courses which have no notes? The MIT OCW script has specific tags we can search against to find all courses which were uploaded by the script and have no notes.

— Reply to this email directly or view it on GitHub.

btbonval commented 10 years ago

Done. According to the front page, only one course has no notes now. I see you also deleted another spam course that popped up.

btbonval commented 10 years ago

Andrew and I agree this ticket is done, but we might continue the discussion about MIT notes on it.

btbonval commented 10 years ago

I did this to clean out MIT OCW courses with no notes. Ugly nested subqueries, but it is fast enough and gets the job done. Might be worth an additional join so the tag IDs are not hard coded.

DELETE FROM courses_course
WHERE id IN
    (SELECT id FROM
        (SELECT cc.id, COUNT(nn.id) AS notes
         FROM courses_course AS cc
             INNER JOIN taggit_taggeditem AS tt ON (tt.object_id = cc.id)
             LEFT OUTER JOIN notes_note AS nn ON (cc.id = nn.course_id)
         WHERE tt.tag_id IN (108,109) GROUP BY cc.id) AS subquery
     WHERE notes = 0);
btbonval commented 9 years ago

python script for counting notes per course in the OCW json file.

import sys
import json
from itertools import imap

# filename supplied as the first argument
filename = sys.argv[1]

# load the json structure from the supplied filename
fd = open(filename, 'r')
fc = json.load(fd)
fd.close()

# prepare some structures
courses = fc['courses']
ncourses = len(courses)
def num_links(obj):
    # return number of links, or 0 if the key is missing
    return (obj.has_key('noteLinks') or 0) and len(obj['noteLinks'])

# sum the notes for all courses
nnotes = sum(imap(num_links, iter(courses)))

print "{0},{1}".format(nnotes, filename)

Run it something like so:

find ./ -name "*.json" -print0 | xargs -0 -i% python ../count.py % | sort -n