Open saanikat opened 2 months ago
Added TrainStandardizer
for training custom models by the user.
The present README.md
doesn't have the details, documentation will be added to bedbase docs
Available schemas for standardization have been moved to the HuggingFace repository :https://huggingface.co/databio/attribute-standardizer-model6
This solves the issue of having to update BEDMS each time a new schema is added by us.
README.md
has been updated with the new function calls.
Earlier we would instantiate it like like:
from bedms import AttrStandardizer
model = AttrStandardizer("ENCODE")
BEDMS had mapped schema name ENCODE
to the model, and its associated configuration and files. Similarly, BEDBASE
and FAIRTRACKS
were associated with their respective files. However, this would've required us to update the package each time we added a new schema.
Now, we need to provide the schema model and its associated configuration to BEDMS via HuggingFace. In the HuggingFace repository, each schema has its own directory. And each time a schema is added, a new schema directory would be added to HuggingFace ( details of adding a new schema have been provided there). The instantiation looks like this:
from bedms import AttrStandardizer
model = AttrStandardizer(
repo_id="databio/attribute-standardizer-model6", model_name="encode"
)
This also makes it easier for the user to provide their chosen schemas ( as long as they have models on their HuggingFace repository ).
Is it worth updating
from bedms.const import AVAILABLE_SCHEMAS
to return a dictionary that includes the repo_id value for the 3 schemas we provide?
Or should we just hardcode the 3 repo IDs from PEPhub?
Issue #23 solved.