databio / bedms

tool for standardization of genomics/epigenomics metadata
BSD 2-Clause "Simplified" License
3 stars 0 forks source link

`attr_standardizer` shouldnt load from huggingface for each call #8

Closed nleroy917 closed 3 months ago

nleroy917 commented 4 months ago

I was following the READNE's instructions:

from attribute_standardizer.attribute_standardizer import attr_standardizer

attr_standardizer(pep=/path/to/pep, schema="ENCODE")

It looks like this function is calling load_from_huggingface for each PEP you want to standardize. While huggingface does cache the model files on disk.. it's very unnecessary to load the model from disk for each PEP you want to standardize.

I would really recommend creating some class that holds the model in memory and can be called many times over for many PEPs:

from bedmess import AttrStandardizer

model = AttrStandardizer("databio/encode-bm")

model.standardize(path_to_pep)
nleroy917 commented 4 months ago

This will rear its head for the first user to use the standardizer on a public server since it downloads the model not when the server starts up but when the user requests the standardization pipeline

saanikat commented 3 months ago

Solved with new PR #15