databio / bedms

tool for standardization of genomics/epigenomics metadata
BSD 2-Clause "Simplified" License
3 stars 0 forks source link

training module #25

Open saanikat opened 2 months ago

saanikat commented 2 months ago

Issue #23 solved.

saanikat commented 1 month ago

Added TrainStandardizer for training custom models by the user. The present README.md doesn't have the details, documentation will be added to bedbase docs

saanikat commented 1 month ago

Separation of schemas from BEDMS

Available schemas for standardization have been moved to the HuggingFace repository :https://huggingface.co/databio/attribute-standardizer-model6 This solves the issue of having to update BEDMS each time a new schema is added by us. README.md has been updated with the new function calls. Earlier we would instantiate it like like:

from bedms import AttrStandardizer

model = AttrStandardizer("ENCODE")

BEDMS had mapped schema name ENCODE to the model, and its associated configuration and files. Similarly, BEDBASE and FAIRTRACKS were associated with their respective files. However, this would've required us to update the package each time we added a new schema. Now, we need to provide the schema model and its associated configuration to BEDMS via HuggingFace. In the HuggingFace repository, each schema has its own directory. And each time a schema is added, a new schema directory would be added to HuggingFace ( details of adding a new schema have been provided there). The instantiation looks like this:

from bedms import AttrStandardizer

model = AttrStandardizer(
    repo_id="databio/attribute-standardizer-model6", model_name="encode"
)

This also makes it easier for the user to provide their chosen schemas ( as long as they have models on their HuggingFace repository ).

sanghoonio commented 1 month ago

Is it worth updating

from bedms.const import AVAILABLE_SCHEMAS

to return a dictionary that includes the repo_id value for the 3 schemas we provide?

Or should we just hardcode the 3 repo IDs from PEPhub?