NESCent / FossilCalibrations

Fossil calibrations database
http://fossilcalibrations.org
BSD 2-Clause "Simplified" License
14 stars 4 forks source link

Database character sets are inconsistent #52

Closed dleehr closed 9 years ago

dleehr commented 9 years ago

While doing some API work I noticed that some tables use latin1, but others are utf8. Views that join the two result in mangled character encodings (at least in my queries)

$ grep -i "charset=latin" FossilCalibration_skel.sql 
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=48 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=18 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=396 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=53 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=176 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=104 DEFAULT CHARSET=latin1;
) ENGINE=MyISAM AUTO_INCREMENT=89 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=112 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=363 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
) ENGINE=InnoDB AUTO_INCREMENT=164 DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
) ENGINE=MyISAM AUTO_INCREMENT=30000 DEFAULT CHARSET=latin1;
$ grep -i "charset=utf" FossilCalibration_skel.sql 
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=InnoDB AUTO_INCREMENT=1521 DEFAULT CHARSET=utf8;
) ENGINE=InnoDB AUTO_INCREMENT=476 DEFAULT CHARSET=utf8;
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
hlapp commented 9 years ago

They should all use UTF8. Otherwise it's just too easy to run into all kinds of character encoding conversion issues.

jimallman commented 9 years ago

Agreed. The latin1 encodings (and matching collation settings) are legacy stuff from @pdpolly 's early work on the database. I'll see what I can do about moving everything over to utf8 without mangling the existing text.

jimallman commented 9 years ago

As it turns out, managing character sets and collation in MySQL is a nightmare. But now I think I have the database, servers, and HTML metatags all in agreement about using UTF-8. Your grep test above comes up empty in the latest data dump (note the new filename):

# grep -i "charset=latin" FossilCalibration_full_UTF8.sql

If you instead grep for latin, you'll see lots of chatter, but I believe these are no-ops that will not affect the data on import.

If the new database works well for you, let me know and I'll generate the skel and other versions in the same way.

jimallman commented 9 years ago

If the new database works well for you, let me know and I'll generate the skel and other versions in the same way.

Update: I've successfully restored the previous dump on our new dev server. I did need to reset the database's default collation to keep everything consistent:

ALTER DATABASE FossilCalibration CHARACTER SET utf8 COLLATE utf8_general_ci ;

The next dump will have this preset. We're still making minor changes to the data model, so I'll wait on the next skeleton database until things have settled down.

kcranston commented 9 years ago

Is this completed, then?

jimallman commented 9 years ago

I believe so, I just try not to close issues that I didn't create. @dleehr, are you still seeing any mangled characters?

dleehr commented 9 years ago

Yeah this is good now