Slim installation option or model inference export option

EducationalTestingService / rsmtool

A Python package to facilitate research on building and evaluating automated scoring models.

https://rsmtool.readthedocs.io

Apache License 2.0

66 stars 17 forks source link

Slim installation option or model inference export option #689

Open twoertwein opened 5 days ago

twoertwein commented 5 days ago

rsmtools has many large dependencies. Many of them are not needed at inference time.

It would be nice to either:

Use the pip compatible install options like pip install "rsmtools[full]" to install all dependencies only when needed
Add a method to export an already trained model so that it can be run without requiring rsmtools (for example as a sklearn pipeline). Ideally it would allow specifying all the options that fast_predict supports.

desilinguist commented 5 days ago

I think (2) is a good idea. To start with, since RSMTool models are SKLL models, we should be able to easily set the pipeline attribute and then you would just need SKLL for the inference side. You'd still have more dependencies but SKLL has much fewer extra ones over scikit-learn, compared to RSMTool. Of course, this would require extra disk space as well.

twoertwein commented 5 days ago

It might also be nice to remove pyarrow, openpyxl, xlrd, and xlwt from requirements.txt. These libraries are never directly called in rsmtools. If a user wants to let pandas read parquet files, they should install pandas's optional dependencies (fastparquet/pyarrow).

desilinguist commented 5 days ago

I believe openpyxl and xlrd are needed for Excel support in Pandas or at least used to be? pyarrow was added because pandas is going to make it required starting with 3.0.

twoertwein commented 5 days ago

I believe openpyxl and xlrd are needed for Excel support in Pandas or at least used to be?

Yes, these are optional dependencies of pandas needed to read/write excel files. Personally, I think users are responsible for installing them - not even pandas installs them by default.

pyarrow was added because pandas is going to make it required starting with 3.0.

I believe that was reverted :) (and fastparquet is much smaller and available on more architectures)

desilinguist commented 5 days ago

I think the reason for pre-installing those libraries was because RSMTool was pitched as a fully-self-contained solution that works out of the box and because Excel spreadsheets were the main input files at ETS. But if that's no longer the case, then that's probably fine.

Good to know about pyarrow - we should definitely remove that then.