Update script to compute precision and recall

With latest corrected test labels, scores mostly go slightly down:

           Precision    Recall        F1
bender      0.841668  0.827917  0.834736
carrerasa   0.832494  0.851714  0.841994
carrerasb   0.846577  0.819215  0.832671
chieu       0.873324  0.879062  0.876184
curran      0.837620  0.850115  0.843822
demeulder   0.746768  0.769313  0.757873
florian     0.883717  0.879950  0.881830
hammerton   0.674948  0.521044  0.588094
hendrickx   0.754771  0.793642  0.773719
klein       0.859314  0.863435  0.861369
mayfield    0.835421  0.841058  0.838230
mccallum    0.839821  0.831469  0.835624
munro       0.800307  0.832712  0.816188
whitelaw    0.804489  0.770201  0.786972
wu          0.811240  0.807494  0.809363
zhang       0.857426  0.845853  0.851600

@kmh4321 if you are able to, could you double-check this and I'll update stats.tex for the paper

CODAIT / Identifying-Incorrect-Labels-In-CoNLL-2003

Update script to compute precision and recall #10