clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

UAZ IDs could be more consistent #725

Closed kwalcock closed 3 years ago

kwalcock commented 3 years ago

These IDs are given out on a first come, first served basis. This works well when only one file is being processed, but if multiple files are being processed in parallel, the output of two separate runs will be different. If that is important, then it would be possible to generate the 5-digit numbers (is that enough?) by taking the hash code of the string % 100,000 and then controlling for collisions. Controlling may mean keeping track of what string corresponds to each of the 100,000 slots (or if that is done elsewhere, that a slot is in use), but it may be worth it. There would also be some consistency across runs as well.

In some parts of the code it seems like some IDs are based on some text. One could just encode that text as the byte stream converted to hex and put it into the identifiers. The IDs might be of different lengths, but they would be unique and consistent and would never overflow the 100,000. In this way each of my test runs would produce the same result, even with multiple files and multiple threads.

MihaiSurdeanu commented 3 years ago

Nice. I don't think this is very important for downstream users. But I agree this would be nice to have for reproducibility.

kwalcock commented 3 years ago

Fixed with #733.