Open twoertwein opened 5 days ago
I think (2) is a good idea. To start with, since RSMTool models are SKLL models, we should be able to easily set the pipeline
attribute and then you would just need SKLL for the inference side. You'd still have more dependencies but SKLL has much fewer extra ones over scikit-learn, compared to RSMTool. Of course, this would require extra disk space as well.
It might also be nice to remove pyarrow
, openpyxl
, xlrd
, and xlwt
from requirements.txt. These libraries are never directly called in rsmtools. If a user wants to let pandas read parquet files, they should install pandas's optional dependencies (fastparquet/pyarrow).
I believe openpyxl
and xlrd
are needed for Excel support in Pandas or at least used to be? pyarrow
was added because pandas
is going to make it required starting with 3.0.
I believe
openpyxl
andxlrd
are needed for Excel support in Pandas or at least used to be?
Yes, these are optional dependencies of pandas needed to read/write excel files. Personally, I think users are responsible for installing them - not even pandas installs them by default.
pyarrow
was added becausepandas
is going to make it required starting with 3.0.
I believe that was reverted :) (and fastparquet is much smaller and available on more architectures)
I think the reason for pre-installing those libraries was because RSMTool was pitched as a fully-self-contained solution that works out of the box and because Excel spreadsheets were the main input files at ETS. But if that's no longer the case, then that's probably fine.
Good to know about pyarrow
- we should definitely remove that then.
rsmtools has many large dependencies. Many of them are not needed at inference time.
It would be nice to either:
pip install "rsmtools[full]"
to install all dependencies only when neededfast_predict
supports.