clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Fix some problems with assembly output and UAZ IDs #733

Closed kwalcock closed 3 years ago

kwalcock commented 3 years ago

These changes will help the output of ReachCLI be more consistent, in part so that I can tell more easily with a regression test whether anything broke and so people can reproduce our results, etc.

The one change that is questionable is the reformatting of the IDs. Previously we had UAZ12345. The number was dependent on the order in which things were found and depended therefore on the run (order files are retrieved from the OS, differences in thread counts, timing, etc.). This has been changed to UAZ(XX)*, so for example UAZ34650C. The numbers are just the hex values of the characters in the text which needs an ID and will therefore differ in length, perhaps a disadvantage. However, a text will map to the same characters no matter when the files are run, with benefits of consistency and the possibility of merging output of different runs. One can tell which encoding scheme is used because one has 5 digits and the other always had an even number, never 5. There are lots of other encoding schemes, some more compact, but this one largely preserved the idea of a number at the end so that old and new look similar.

kwalcock commented 3 years ago

Others may need to weigh in on the change of IDs.

MihaiSurdeanu commented 3 years ago

Please merge.