Closed sethwoodworth closed 10 years ago
Made all the SearchIndex
methods return so they won't flip out. Added a counter for three courses. Tried to import.
This is weird. The script tries to retrieve a school, but never tries to save the school.
(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py import_ocw_json .
IntegrityError: null value in column "school_id" violates not-null constraint
School retrieval works just fine. School.id
is not null, so it shouldn't be set to null anywhere it is saved to.
In [1]: from apps.courses.models import School
In [2]: dbschool = School.objects.filter(usde_id=121415)[0]
In [3]: dbschool
Out[3]: <School: School 121415: Massachusetts Institute of Technology>
In [4]: dbschool.id
Out[4]: 1839
Only three models save school_id.
courses/models.py: school = models.ForeignKey(School) # Should this be optional ever?
courses/models.py: school = models.ForeignKey(School)
users/models.py: school = models.ForeignKey(School, blank=True, null=True)
Users isn't touched in this file. Department and Course have FKs. Department and Course both save the retrieved school. Might need to pdb this one.
Got it. Course uses get_or_create()
with only name and department. If the Course doesn't exist, then it is created, but it isn't given a School at that time.
not null probably doesn't add much to Courses given that school will be removed in favor of department. For now, I'm just gonna remove that in the working database.
Man these errors are terrible without tracebacks. Something about this debug stuff actually prevents tracebacks, they used to show up in prod mode on the VM, but this prod/dev chimera doesn't show tracebacks. Either that or a change was made to the Django env.
Course is in the database: Introduction to Neuroscience
Uploading link http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-01-introduction-to-neuroscience-fall-2007/lecture-notes/09_vision1.pdf to FP.
HTTPError: 403 Client Error: FORBIDDEN
Filepicker didn't used to return forbidden.
We did just change the filepicker API. I suppose it won't hurt to use the old one as a test.
yup. original filepicker API works fine, newer one fails. Does that mean beta Filepicker will fail?
Things look mostly good. The static URL is returning error 403. https://s3.amazonaws.com/karma-beta/html/09_vision1pdf.html
While this link works just fine both in a new tab and imported onto the page: https://s3.amazonaws.com/karma-beta/css/global.css
folders within buckets do not have special permissions. buckets have permissions as a whole.
farg.
Interestingly, "Static web hosting" is not enabled for the bucket at all. So whatever we're doing, we're not checking those tick marks.
Man, I remember this from before. there's some evil voodoo crap going on. Some things work and some things do not work. Last time I had to nuke the VM and start over and suddenly CSS and so forth started working from S3. No changes to the S3 server made.
The original S3 static hosting instructions used certainly did not mention anything at all about changing S3 settings themselves, just how to make Django push static files up to S3. https://github.com/FinalsClub/karmaworld/issues/65#issuecomment-12974597
Previous dark time with no real resolution: https://github.com/FinalsClub/karmaworld/issues/192
I think we're not doing it right, but somehow we're getting lucky.
Each S3 object has its permissions. There is no way to inherit permissions from the bucket. There is no way to batch apply permissions across all objects in a bucket through the S3 interface.
The only answer here is to change permissions on the Key at upload time in the Note.send_to_s3()
code.
Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.
load_data
which checks that, and if not owned, migrates it over?I don't believe there is a way to check which account a link belongs to.
On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:
Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.
Can we check if the fp_file URL is owned by the current Filepicker API? Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over? — Reply to this email directly or view it on GitHub.
We'll have to use the prod Filepicker account creds on your VM for the MIT data.
On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:
Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.
Can we check if the fp_file URL is owned by the current Filepicker API? Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over? — Reply to this email directly or view it on GitHub.
I'm hoping the Filepicker API will have something clever I can use. Even if it has to do it by checksum (which could be slow, but worthwhile).
Actually I have been using the prod Filepicker account on my VM, which has a secondary benefit: I can see the HTML uploaded on S3.
Beta's Filepicker uploads to Filepicker S3 or whatever that we don't have access to.
On Thu, Jan 16, 2014 at 8:49 PM, Andrew Magliozzi notifications@github.comwrote:
I don't believe there is a way to check which account a link belongs to.
On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:
Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.
Can we check if the fp_file URL is owned by the current Filepicker API? Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over? — Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-32573422 .
You're right. Nothing helpful with the Filepicker API. You can CRUD each file given its filepicker URL, but there isn't even a way to list files. https://developers.inkfilepicker.com/docs/web/#rest
That puts a very minor wrench in the cogs. It means we won't be able to test this import stuff on beta without pointing at prod's static S3 URL. Easy thing to do for a quick read test, and then change it back. -Bryan
On Thu, Jan 16, 2014 at 8:50 PM, Bryan btbonval@gmail.com wrote:
I'm hoping the Filepicker API will have something clever I can use. Even if it has to do it by checksum (which could be slow, but worthwhile).
Actually I have been using the prod Filepicker account on my VM, which has a secondary benefit: I can see the HTML uploaded on S3.
Beta's Filepicker uploads to Filepicker S3 or whatever that we don't have access to.
On Thu, Jan 16, 2014 at 8:49 PM, Andrew Magliozzi < notifications@github.com> wrote:
I don't believe there is a way to check which account a link belongs to.
On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:
Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.
Can we check if the fp_file URL is owned by the current Filepicker API? Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over? — Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-32573422 .
Since the first 15 or so notes and were converted to HTML poorly, I deleted them on the S3. (I also deleted the other things converted poorly with HTML in the database) https://github.com/FinalsClub/karmaworld/issues/273#issuecomment-32675516
Reran populate_s3 to fix the stuff with HTML in the database.
That left 15 notes that aren't statically hosted on S3, of which a few are from the previous import OCW tests:
karmanotes=# SELECT cc.id, cc.slug FROM courses_course AS cc, notes_note AS nn WHERE nn.static_html = FALSE AND cc.id = nn.course_id;
id | slug
-----+-------------------------------------------------------
52 | economics-10
65 | culture-and-belief-17-the-roman-games
76 | societies-of-the-world-39-slavery-and-slave-trade
45 | metaphysical-poetry
55 | history-1330-social-thought-in-modern-america
39 | government-1295-comparative-politics-in-latin-america
46 | psychology-13-cognitive-psychology
48 | us-and-the-world-13-medicine-and-society-in-america
54 | government-1540-the-american-presidency
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
(15 rows)
Intro to Neuroscience is from the previous OCW attempt. That has been cascade deleted and will be repopulated with the import OCW script.
What are these other things? They are all notes which have null html and null text.
karmanotes=# SELECT course_id, slug, length(html) AS html_len, length(text) AS text_len FROM notes_note WHERE static_html = FALSE;
course_id | slug | html_len | text_
len
-----------+--------------------------------------------------+----------+----------
52 | aggregate-demand-componentspdf | |
65 | the-roman-games-study-guide | |
76 | slavery-and-slave-trade-study-guide-11-9-378297 | |
45 | classnotes-from-22305 | |
55 | guide-to-jello | |
39 | comparative-politics-of-latin-americ-class-notes | |
46 | cognitive-psychology-notes | |
48 | medicine-and-society-midterm-2-guide-11-9-60087 | |
54 | the-american-presidency-study-guide | |
(9 rows)
karmanotes=# SELECT length(NULL);
length
--------
(1 row)
karmanotes=# SELECT cc.name,nn.name,nn.uploaded_at,fp_file,mimetype,file_type,pdf_file,gdrive_url FROM notes_note AS nn, courses_course AS cc WHERE static_html = FALSE AND cc.id = nn.course_id;
name | name | uploaded_at | fp_file | mimetype | file_type | pdf_file | gdrive_url
---------------------------------------------------------+----------------------------------------------------+-------------------------------+---------+----------+-----------+----------+------------
Economics 10 | Aggregate Demand Components.pdf | 2013-11-09 18:11:36.495527+00 | | | ??? | |
Culture and Belief 17 - The Roman Games | The Roman Games - Study Guide | 2013-11-09 18:11:50.345225+00 | | | ??? | |
Societies of the World 39 - Slavery and Slave Trade | Slavery and Slave Trade - Study Guide | 2013-11-09 18:11:47.378297+00 | | | ??? | |
Metaphysical Poetry | Classnotes from 2/23/05 | 2013-11-09 18:11:43.725581+00 | | | ??? | |
History 1330 - Social Thought in Modern America | Guide to Jello | 2013-11-09 18:11:43.736942+00 | | | ??? | |
Government 1295 - Comparative Politics in Latin America | Comparative Politics of Latin Americ - Class Notes | 2013-11-09 18:11:49.267428+00 | | | ??? | |
Psychology 13 - Cognitive Psychology | Cognitive Psychology - Notes | 2013-11-09 18:11:46.11973+00 | | | ??? | |
US and the World 13 - Medicine and Society in America | Medicine and Society - Midterm 2 Guide | 2013-11-09 18:11:47.060087+00 | | | ??? | |
Government 1540 - The American Presidency | The American Presidency - Study Guide | 2013-11-09 18:11:49.523136+00 | | | ??? | |
(9 rows)
Interestingly all the blank notes were uploaded On 9 November 2013. Probably not a coincidence. There is no information which might help recover these files besides the note name and course name. Even then only the originator would know what that name that file refers to. Deleting those notes from the database.
Running MIT OCW BCS dept notes on production in tmux window.
First note finished and shows up in the right course. http://www.karmanotes.org/massachusetts-institute-of-technology/introduction-to-neuroscience-121/09_vision1pdf
Looks good. links open in a new window. Will leave the script running and check on it later.
I think I can find those blank files again. Stay tuned.
Andrew
On Jan 18, 2014, at 1:31 AM, Bryan Bonvallet notifications@github.com wrote:
Since the first 15 or so notes and were converted to HTML poorly, I deleted them on the S3. (I also deleted the other things converted poorly with HTML in the database)
273 (comment)
Reran populate_s3 to fix the stuff with HTML in the database.
That left 15 notes that aren't statically hosted on S3, of which a few are from the previous import OCW tests:
karmanotes=# SELECT cc.id, cc.slug FROM courses_course AS cc, notes_note AS nn WHERE nn.static_html = FALSE AND cc.id = nn.course_id; id | slug
-----+------------------------------------------------------- 52 | economics-10 65 | culture-and-belief-17-the-roman-games 76 | societies-of-the-world-39-slavery-and-slave-trade 45 | metaphysical-poetry 55 | history-1330-social-thought-in-modern-america 39 | government-1295-comparative-politics-in-latin-america 46 | psychology-13-cognitive-psychology 48 | us-and-the-world-13-medicine-and-society-in-america 54 | government-1540-the-american-presidency 120 | introduction-to-neuroscience-120 120 | introduction-to-neuroscience-120 120 | introduction-to-neuroscience-120 120 | introduction-to-neuroscience-120 120 | introduction-to-neuroscience-120 120 | introduction-to-neuroscience-120 (15 rows) Intro to Neuroscience is from the previous OCW attempt. That has been cascade deleted and will be repopulated with the import OCW script.What are these other things? They are all notes which have null html and null text.
karmanotes=# SELECT course_id, slug, length(html) AS html_len, length(text) AS text_len FROM notes_note WHERE static_html = FALSE; course_id | slug | htmllen | text len -----------+--------------------------------------------------+----------+---------- 52 | aggregate-demand-componentspdf | |
65 | the-roman-games-study-guide | |
76 | slavery-and-slave-trade-study-guide-11-9-378297 | |
45 | classnotes-from-22305 | |
55 | guide-to-jello | |
39 | comparative-politics-of-latin-americ-class-notes | |
46 | cognitive-psychology-notes | |
48 | medicine-and-society-midterm-2-guide-11-9-60087 | |
54 | the-american-presidency-study-guide | |
(9 rows) karmanotes=# SELECT length(NULL);length
(1 row) karmanotes=# SELECT cc.name,nn.name,nn.uploaded_at,fp_file,mimetype,file_type,pdf_file,gdrive_url FROM notes_note AS nn, courses_course AS cc WHERE static_html = FALSE AND cc.id = nn.course_id; name | name | uploaded_at | fp_file | mimetype | file_type | pdf_file | gdrive_url ---------------------------------------------------------+----------------------------------------------------+-------------------------------+---------+----------+-----------+----------+------------ Economics 10 | Aggregate Demand Components.pdf | 2013-11-09 18:11:36.495527+00 | | | ??? | | Culture and Belief 17 - The Roman Games | The Roman Games - Study Guide | 2013-11-09 18:11:50.345225+00 | | | ??? | | Societies of the World 39 - Slavery and Slave Trade | Slavery and Slave Trade - Study Guide | 2013-11-09 18:11:47.378297+00 | | | ??? | | Metaphysical Poetry | Classnotes from 2/23/05 | 2013-11-09 18:11:43.725581+00 | | | ??? | | History 1330 - Social Thought in Modern America | Guide to Jello | 2013-11-09 18:11:43.736942+00 | | | ??? | | Government 1295 - Comparative Politics in Latin America | Comparative Politics of Latin Americ - Class Notes | 2013-11-09 18:11:49.267428+00 | | | ??? | | Psychology 13 - Cognitive Psychology | Cognitive Psychology - Notes | 2013-11-09 18:11:46.11973+00 | | | ??? | | US and the World 13 - Medicine and Society in America | Medicine and Society - Midterm 2 Guide | 2013-11-09 18:11:47.060087+00 | | | ??? | | Government 1540 - The American Presidency | The American Presidency - Study Guide | 2013-11-09 18:11:49.523136+00 | | | ??? | | (9 rows) Interestingly all the blank notes were uploaded On 9 November 2013. Probably not a coincidence. There is no information which might help recover these files besides the note name and course name. Even then only the originator would know what that name that file refers to. Deleting those notes from the database.
— Reply to this email directly or view it on GitHub.
BCS and Chemistry department notes uploaded for MIT OCW.
Beginning Anthropology and Economics.
All the notes in Intro to Anthro are missing, but the script now skips missing upstream links:
Course is in the database: Introduction to Anthropology
Uploading link http://ocw.mit.edu/courses/anthropology/21a-100-introduction-to-a
nthropology-fall-2004/lecture-notes/Ses1_OPENER.pdf to FP.
Failed to upload note: 404 Client Error: NOT FOUND
Wrote a quick little ditty. Notes by department (I'm shooting for departments in the middle as we prioritize):
28 , ./Athletics, Physical Education, and Recreation.json
36 , ./Literature.json
63 , ./Writing and Humanistic Studies.json
82 , ./History.json
112 , ./Women's and Gender Studies.json
140 , ./Media Arts and Sciences.json
151 , ./Experimental Study Group.json
174 , ./Music and Theater Arts.json
177 , ./Science, Technology, and Society.json
213 , ./Comparative Media Studies.json
226 , ./Foreign Languages and Literatures.json
312 , ./Special Programs.json
320 , ./Architecture.json
330 , ./Biology.json
347 , ./Anthropology.json
463 , ./Political Science.json
475 , ./Nuclear Science and Engineering.json
478 , ./Brain and Cognitive Sciences.json
501 , ./Biological Engineering.json
536 , ./Chemistry.json
553 , ./Chemical Engineering.json
602 , ./Health Sciences and Technology.json
704 , ./Economics.json
727 , ./Linguistics and Philosophy.json
857 , ./Physics.json
883 , ./Materials Science and Engineering.json
943 , ./Urban Studies and Planning.json
1088 , ./Earth, Atmospheric, and Planetary Sciences.json
1166 , ./Engineering Systems Division.json
1361 , ./Aeronautics and Astronautics.json
1450 , ./Mechanical Engineering.json
1484 , ./Civil and Environmental Engineering.json
1926 , ./Management.json
2186 , ./Mathematics.json
3324 , ./Electrical Engineering and Computer Science.json
Anthropology and Economics uploaded.
Physics and PolySci, why not? Launched for import.
Awesome! PS - I found two more spam courses
On Mon, Jan 20, 2014 at 4:15 AM, Bryan Bonvallet notifications@github.comwrote:
Anthropology and Economics uploaded.
Physics and PolySci, why not? Launched for import.
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-32743865 .
PPS - Can we remove all courses where the professor is null?
On Mon, Jan 20, 2014 at 9:53 AM, Andrew Magliozzi < andrew.magliozzi@gmail.com> wrote:
Awesome! PS - I found two more spam courses
On Mon, Jan 20, 2014 at 4:15 AM, Bryan Bonvallet <notifications@github.com
wrote:
Anthropology and Economics uploaded.
Physics and PolySci, why not? Launched for import.
— Reply to this email directly or view it on GitHubhttps://github.com/FinalsClub/karmaworld/issues/68#issuecomment-32743865 .
There are 316 notes for courses taught by null professors.
karmanotes=# SELECT count(nn.id) FROM notes_note AS nn, courses_course AS cc, courses_professortaught AS cpt WHERE nn.course_id = cc.id AND cc.id = cpt.course_id AND cpt.professor_id = 1;
count
-------
316
(1 row)
Are you asking me to clear out the MIT OCW courses which have no notes? The MIT OCW script has specific tags we can search against to find all courses which were uploaded by the script and have no notes.
The subquery would look something like this (no ORDER BY clause):
SELECT cc.id, COUNT(nn.id) AS notes FROM courses_course AS cc INNER JOIN taggit_taggeditem AS tt ON (tt.object_id = cc.id) LEFT OUTER JOIN notes_note AS nn ON (cc.id = nn.course_id) WHERE tt.tag_id IN (108,109) GROUP BY cc.id ORDER BY notes ASC, cc.id ASC;
231 MIT scraped courses have no notes. 200 MIT scraped courses have notes.
That is exactly what I was thinking.
On Jan 20, 2014, at 2:53 PM, Bryan Bonvallet notifications@github.com wrote:
There are 316 notes for courses taught by null professors.
karmanotes=# SELECT count(nn.id) FROM notes_note AS nn, courses_course AS cc, courses_professortaught AS cpt WHERE nn.course_id = cc.id AND cc.id = cpt.course_id AND cpt.professor_id = 1;
count
316 (1 row) Are you asking me to clear out the MIT OCW courses which have no notes? The MIT OCW script has specific tags we can search against to find all courses which were uploaded by the script and have no notes.
— Reply to this email directly or view it on GitHub.
Done. According to the front page, only one course has no notes now. I see you also deleted another spam course that popped up.
Andrew and I agree this ticket is done, but we might continue the discussion about MIT notes on it.
I did this to clean out MIT OCW courses with no notes. Ugly nested subqueries, but it is fast enough and gets the job done. Might be worth an additional join so the tag IDs are not hard coded.
DELETE FROM courses_course
WHERE id IN
(SELECT id FROM
(SELECT cc.id, COUNT(nn.id) AS notes
FROM courses_course AS cc
INNER JOIN taggit_taggeditem AS tt ON (tt.object_id = cc.id)
LEFT OUTER JOIN notes_note AS nn ON (cc.id = nn.course_id)
WHERE tt.tag_id IN (108,109) GROUP BY cc.id) AS subquery
WHERE notes = 0);
python script for counting notes per course in the OCW json file.
import sys
import json
from itertools import imap
# filename supplied as the first argument
filename = sys.argv[1]
# load the json structure from the supplied filename
fd = open(filename, 'r')
fc = json.load(fd)
fd.close()
# prepare some structures
courses = fc['courses']
ncourses = len(courses)
def num_links(obj):
# return number of links, or 0 if the key is missing
return (obj.has_key('noteLinks') or 0) and len(obj['noteLinks'])
# sum the notes for all courses
nnotes = sum(imap(num_links, iter(courses)))
print "{0},{1}".format(nnotes, filename)
Run it something like so:
find ./ -name "*.json" -print0 | xargs -0 -i% python ../count.py % | sort -n
KarmaNotes is using CC-by on all pages.
inherit OCW CC-by-nc onto OCW pages for both course and note.
possibly create a license table. There'd be two entries to start: index 0 = CC-by, 1 = CC-by-nc. Add license FK into course and note models to license.
Default = 0 for KarmaNotes.
Importing from OCW will explicitly set license to 1.