TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
9 stars 2 forks source link

Reduce DrugChemical for load into SAPBERT #330

Closed gaurav closed 2 weeks ago

gaurav commented 3 months ago

This PR produces a DrugChemicalSmaller file so that we can load it into SAPBERT. It includes three changes:

  1. I've tweaked the order of the prefix boosts so that we pick slightly better names.
  2. A new demote_labels_longer_than config setting (currently set to 15) filters out any name longer than that size as long as at least one label equal to or less than that size is available.
  3. Generate a SAPBERT training file called DrugChemicalConflatedSmaller.txt.gz which only includes cliques from DrugChemicalConflated if the preferred label is shorter than DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH.
DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH can be used to control how big DrugChemicalConflatedSmaller.txt is: DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH Training rows Unique CURIEs
50 25,187,771 19,835,134
40 15,450,212 10,571,449
30 9,620,711 6,056,510
15 4,803,665 3,808,855

DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH=30 seems like a reasonable setting right now.

Closes #313.