This PR produces a DrugChemicalSmaller file so that we can load it into SAPBERT. It includes three changes:
I've tweaked the order of the prefix boosts so that we pick slightly better names.
A new demote_labels_longer_than config setting (currently set to 15) filters out any name longer than that size as long as at least one label equal to or less than that size is available.
Generate a SAPBERT training file called DrugChemicalConflatedSmaller.txt.gz which only includes cliques from DrugChemicalConflated if the preferred label is shorter than DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH.
DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH can be used to control how big DrugChemicalConflatedSmaller.txt is:
DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH
Training rows
Unique CURIEs
50
25,187,771
19,835,134
40
15,450,212
10,571,449
30
9,620,711
6,056,510
15
4,803,665
3,808,855
DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH=30 seems like a reasonable setting right now.
This PR produces a DrugChemicalSmaller file so that we can load it into SAPBERT. It includes three changes:
demote_labels_longer_than
config setting (currently set to 15) filters out any name longer than that size as long as at least one label equal to or less than that size is available.DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH
.DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH
can be used to control how big DrugChemicalConflatedSmaller.txt is:DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH=30
seems like a reasonable setting right now.Closes #313.