Issues reproducing Precision/Recall/F1/F2 on the i2b2 dataset

Hi,

Thank you for the development and release of this package. I followed the steps 0, 2a, 1b, 1c using the PHI config file, and then 2d with prod=True. In calculation of the scores and following my understanding of the paper, I separated all PHI text on the word level including sanitizing for edge cases such as "," and "." at the end of words (otherwise the stats are much lower). However, I was only able to achieve Precision 0.696 Recall 0.915 F1 0.791 F2 0.861 on the test set, which is some way away from the statistics reported on the i2b2 test set in the paper. I think I am most likely missing something, but am unsure what it is.

BCHSI / philter-ucsf

Issues reproducing Precision/Recall/F1/F2 on the i2b2 dataset #11