feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.43k stars 966 forks source link

Feature Request: InfluxDB ReadOnly On/Offline, use Waitress instead of Gunicorn, and integrate Feast.client into repo #3973

Closed SeanWidesense closed 3 months ago

SeanWidesense commented 5 months ago

Hello!

Couple of thoughts (simple implementations) that would be fantastic improvements for Feast: 1) InfluxDB: You've integrated with TimescaleDB (a.k.a. Postgres) and Redis. Why not integrate with a simple read provider to influxdb? I believe the influxdb client is as simple as Redis, although influxdb is less of a "table like" database, so I don't believe deletes/updates would be necessary (it might be nice to ingest influx and store it into a DB like redshift/postgres).
2) Waitress option for WSGI as an alternative to gunicorn - I installed Feast on my AWS EC2 instance no problem! I went to use it on my windows dev machine, and voila, my entire project came to a big halt! "Fcntl not found." Gunicorn relies on package fcntl for file locking, but fcntl is not available on windows except possibly via WSL (Windows Subsystem for Linux). If you all want widescale adoption, developers are on PCs/Macs/Linux - choosing a Linux specific system seems counter productive to the adoption of Feast. I realize I may get some windows-haters comments, but the fact is, the more successful Feast is, the more adoption it will have and it will require support on all platforms. Alternate solution would be to make a "feast[client]" that doesn't install the gunicorn package. 3) Feast Client integration: At the time I'm writing this, it doesn't appear that Feast's client code is in the codebase (https://api.docs.feast.dev/python/_modules/feast/client). It would be great for developers to be able to access Feast as they do with MLFLow:

` import mlflow remote_server_uri = "http://192.1.6.299:5000" mlflow.set_tracking_uri(remote_server_uri)

if save_model: with mlflow.start_run(run_name="test_run"): mlflow.log_artifact("jupyter_notebook_2024_02_23_MLFLOW.py") mlflow.log_artifact("cleaned_parquet_file_166") `

Look at how simple it is to integrate MLFlow into my workflow!!! I believe this is very attainable for Feast, and kind of necessary in a client/server world.

Finally, THANK YOU for upgrade support of Pandas 2.2.x @sudohainguyen! That was an absolute show stopper for us if it wasn't resolved (currently I'm running with Pandas 2.2.1 while the PR gets approved and merged into the pip binaries).

tokoko commented 5 months ago

Hi, I'm not familiar enough with InfluxDB to answer that, but let me address other points.

  1. I don't think replacing with waitress is the way to go here. Most production python servers will be running on Linux realistically and incurring unnecessary performance penalty there doesn't sound ideal. But we can probably easily turn both of them as platform-specific dependencies and use waitress on windows, gunicorn on unix. Pretty sure mlflow does the same thing. I think the bigger problem here is that we don't run tests for windows right now. It's probably a good idea to start doing that at some point, at least for more "client-side" components. Correct me if I'm wrong but I don't think feast[client] extra is possible, simply because python extras only allow you to add extra dependencies, never remove existing ones. But I guess we can go the other way and pull some of these dependencies in feast[feature_server] extra or something like that.

  2. I'm not really sure what you mean by client code tbh and the link provided is outdated now. The main client class now is FeatureStore. see the latest quickstart for examples. You would do something like this:

    
    from feast import FeatureStore
    feature_store = FeatureStore('.') # in case you already have feature_store.yaml inplace

feature_store.get_online_features(...) # in online application feature_store.get_hitorical_features(...) # in offline workflow



Thanks
SeanWidesense commented 5 months ago

Thanks @tokoko -

1) Ironically, Influxdb is a timeseries database, which according to Feast documentation is the preferred data format! InfluxDB is similar to Redis, with the exception that you have SQL like data retrieval calls: "SELECT {measurement_name} FROM {measurement} WHERE time > {time} GROUP BY * ORDER BY ... ", for instance: "select first(temperature) from san_francisco where time >= '1/1/2024' and time < '2/1/2024' group by time(1m)".

2) I like your idea of perhaps: feast[waitress] or feast[gunicorn] to solve the Windows incompatibility with the fcntl package. The performance difference between the two packages is negligible when used in a small-load data science environment where perhaps you have a max of 100 people (we're not trying to scale up to support a Google load :-) ). I realize that most data scientists are using Linux/WSL, however, excluding all Windows is short-sighted from a Feast adoption perspective.

3) In regards to the "Client" - thank you for the clarification - as someone mentioned, documentation can be rather sparse. My example with MLFlow was to illustrate that Data Scientist can connect to a REMOTE MLFlow server from within their python environment and save models and load models, data files, parameters, and log information.

From a Feast perspective, it would be great if there were something similar, like:

from feast import FeatureStore

 # communicate with a remote feast server (supplying IP address/port, as one does with MLFlow, Redis, InfluxDB, an most other servers):
feature_store = FeatureStore(uri='http://192.1.16.20:5000') 

feature_store.get_online_features(...) # in online application
feature_store.get_hitorical_features(...) # in offline workflow
tokoko commented 5 months ago
  1. I think it's a bit more complicated than that. Feast keeps only the latest values of features per entity in online store, so there's no longer any time dimension there, which means time-series databases are not necessarily appropriate for the use case. As for offline stores, while the datasets there are timestamped, typical offline store query does a lot of heavy joins and window functions. I have only ever used influx as a grafana backend with joinless queries over time-series metrics, I'm not sure how it would withstand point-in-time queries that feast produces. It might be suitable, but the fact that influxdb positions itself as a metrics database makes me a bit skeptical.
  2. Different extras aren't even necessary, you can a have a single server extra with requirements like this:
    gunicorn; platform_system != 'Windows'
    waitress; platform_system == 'Windows'
  3. I'm 100% with you on this one. A slight difference in the case of feast will be that instead of a single remote server, there will probably have to be 3 servers.
tokoko commented 4 months ago

@SeanWidesense fyi, gunicorn dependency was removed for windows in #4024

SeanWidesense commented 4 months ago

@SeanWidesense fyi, gunicorn dependency was removed for windows in #4024

Thank you @tokoko! I'll try it as soon as possible.

shuchu commented 3 months ago

The problem is solved as we use "unicorn" instead of "gunicorn" and let Feast can run in Windows and WSL.