FinalsClub / karmaworld

KarmaNotes.org v3.0
GNU Affero General Public License v3.0
7 stars 6 forks source link

VM is getting duplicate schools #252

Closed btbonval closed 10 years ago

btbonval commented 10 years ago

In order to input MIT notes from #68 , we need to find MIT.

On my VM, I found two instances of MIT which differed only by their internal id.

  id   |                 name                  |                 slug                  |   location    |     url      | facebook_id | usde_id | file_count | priority | alias 
-------+---------------------------------------+---------------------------------------+---------------+--------------+-------------+---------+------------+----------+-------
 10464 | Massachusetts Institute of Technology | massachusetts-institute-of-technology | Cambridge, MA | web.mit.edu/ |             |  121415 |          2 | f        | 
  1839 | Massachusetts Institute of Technology | massachusetts-institute-of-technology | Cambridge, MA | web.mit.edu/ |             |  121415 |          0 | f        | 
(2 rows)

I did not find a repeat of MIT on Beta, nor any repeated school:

karmaworld=# SELECT count(*) FROM courses_school;
 count 
-------
  7001
(1 row)

karmaworld=# SELECT count(distinct(usde_id)) FROM courses_school;
 count 
-------
  6999
(1 row)

Turns out the difference was due to No School and Harvard Business School both having NULL usde_id, where NULL entries are left out of distinct entirely.

I'm not sure how this happened. We want to ensure that the school import code will not import schools that already exist, and possibly we should have the sanitize search for dupes. The problem with dupes is if both are being used as a FK. If none are being used, it is arbitrary which is removed and if one is being used, that is the one to keep.

btbonval commented 10 years ago

Actually, it might be way easier to just toss on a unique constraint on slug and usde_id. Whatever crashes from the unique constraint clearly causes the problem, no need for guesswork.

btbonval commented 10 years ago

For now, going to delete the MIT dupe by hand to proceed with #68. There are plenty of other dupes on the VM to toy with if desired:

karmaworld=> SELECT count(*) FROM courses_school;
 count 
-------
 14427
(1 row)

karmaworld=> SELECT count(distinct(usde_id)) FROM courses_school;
 count 
-------
  7484
(1 row)

I'm guessing the import schools thing got run twice or something and doesn't check for dupes. The VM is working with a newer set of schools from US DE, which is probably why it has 7484 unique schools while Beta has only 7001.

charlesconnell commented 10 years ago

No more VM now, so can we close this?

btbonval commented 10 years ago

There will be a new VM, but this ticket might not suffer the same problem. I'm concerned that the problem existed at all and what its implications have for even Heroku, but let's open a ticket when the problem expresses itself in a novel environment.