feathr-ai / feathr

Feathr – A scalable, unified data and AI engineering platform for enterprise
https://join.slack.com/t/feathrai/shared_invite/zt-1ffva5u6v-voq0Us7bbKAw873cEzHOSg
Apache License 2.0
1.97k stars 258 forks source link

[BUG] Feathr REST API NGINX needs to support larger client header/cookie #878

Open Ritaja opened 1 year ago

Ritaja commented 1 year ago

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the Feathr community.

Feathr version

0.9.0

System information

(Problem is not specific to above system info, please refer below)

Describe the problem

Problem 1: The Feathr client communictes directly to Purview(Atlas) when environment var/config is set to purview name and no other details provided in feature registry:

os.environ['feature_registry__purview__purview_name'] = f'{purview_name}' vs setting: os.environ['FEATURE_REGISTRY__API_ENDPOINT']= f'https://{resource_prefix}webapp.azurewebsites.net/api/v1'

Problem 2: When the client uses FEATHR REST API with backend as Purview registration of Features fails with NGINX error: feathr_client.register_features()

<center>Request Header Or Cookie Too Large</center>
<hr><center>nginx/1.18.0</center>
</body>
</html>

Possible fix: adapting https://github.com/feathr-ai/feathr/blob/main/deploy/nginx.conf

Tracking information

No response

Code to reproduce bug

Feathr config yml (common for both problems):

api_version: 1
project_config:
  project_name: "nyctaxi"
  required_environment_variables:
    - "REDIS_PASSWORD"
offline_store:
  adls:
    adls_enabled: true
  wasb:
    wasb_enabled: true
  s3:
    s3_enabled: false
    s3_endpoint: ""
  jdbc:
    jdbc_enabled: false
    jdbc_database: ""
    jdbc_table: ""
  snowflake:
    snowflake_enabled: false
    url: ""
    user: ""
    role: ""
spark_config:
  spark_cluster: "azure_synapse"
  spark_result_output_parts: "1"
  azure_synapse:
    dev_url: ""
    pool_name: ""
    workspace_dir: "abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/nyc_taxi"
    executor_size: "Small"
    executor_num: 1
  databricks:
    workspace_instance_url: ""
    config_template:
      {
        "run_name": "",
        "new_cluster":
          {
            "spark_version": "9.1.x-scala2.12",
            "node_type_id": "Standard_D3_v2",
            "num_workers": 2,
            "spark_conf": {},
          },
        "libraries": [{ "jar": "" }],
        "spark_jar_task": { "main_class_name": "", "parameters": [""] },
      }
    work_dir: ""
online_store:
  redis:
    host: "<replace_with_your_redis>.redis.cache.windows.net"
    port: 6380
    ssl_enabled: True
feature_registry:
  api_endpoint: ""

Problem 1:

environment variable settings for Python client:


resource_prefix = "<replace your prefix>"
storage_accountname = "<replace>"
storage_containername = "<replace>"

key_vault_name=resource_prefix+"kv"
synapse_workspace_url=resource_prefix+"syws"
adls_account=resource_prefix+"dls"
adls_fs_name=resource_prefix+"fs"
purview_name=resource_prefix+"purview"
key_vault_uri = f"https://{key_vault_name}.vault.azure.net"
credential = DefaultAzureCredential(exclude_interactive_browser_credential=False, additionally_allowed_tenants=['*'])
client = SecretClient(vault_url=key_vault_uri, credential=credential)
secretName = "FEATHR-ONLINE-STORE-CONN"
retrieved_secret = client.get_secret(secretName).value

redis_port=retrieved_secret.split(',')[0].split(":")[1]
redis_host=retrieved_secret.split(',')[0].split(":")[0]
redis_password=retrieved_secret.split(',')[1].split("password=",1)[1]
redis_ssl=retrieved_secret.split(',')[2].split("ssl=",1)[1]

os.environ['spark_config__azure_synapse__dev_url'] = f'https://{synapse_workspace_url}.dev.azuresynapse.net'
os.environ['spark_config__azure_synapse__pool_name'] = 'spark31'
os.environ['spark_config__azure_synapse__workspace_dir'] = f'abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/feathr_project'
os.environ['online_store__redis__host'] = redis_host
os.environ['online_store__redis__port'] = redis_port
os.environ['online_store__redis__ssl_enabled'] = redis_ssl
os.environ['REDIS_PASSWORD']=redis_password
feathr_output_path = f'abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/feathr_output'

problematic setting:

add this to environment above: os.environ['feature_registry__purview__purview_name'] = f'{purview_name}' <-- this seems to use Purview client directly communicating to registry; not using REST API

image

if we define the REST API endpoint: instead of os.environ['feature_registry__purview__purview_name'] = f'{purview_name}' add os.environ['FEATURE_REGISTRY__API_ENDPOINT']= f'https://{resource_prefix}webapp.azurewebsites.net/api/v1' then client uses REST API

image

Problem 2:

use tha same feather config yaml from above and the same environment variable but now force client to use REST API using os.environ['FEATURE_REGISTRY__API_ENDPOINT']= f'https://{resource_prefix}webapp.azurewebsites.net/api/v1'

from pyspark.sql import SparkSession, DataFrame
def feathr_udf_day_calc(df: DataFrame) -> DataFrame:
    from pyspark.sql.functions import dayofweek, dayofyear, col
    df = df.withColumn("fare_amount_cents", col("fare_amount")*100)
    return df

batch_source = HdfsSource(name="nycTaxiBatchSource",
                          path="wasbs://public@azurefeathrstorage.blob.core.windows.net/sample_data/green_tripdata_2020-04_with_index.csv",
                          event_timestamp_column="lpep_dropoff_datetime",
                          preprocessing=feathr_udf_day_calc,
                          timestamp_format="yyyy-MM-dd HH:mm:ss")

f_trip_distance = Feature(name="f_trip_distance",
                          feature_type=FLOAT, transform="trip_distance")
f_trip_time_duration = Feature(name="f_trip_time_duration",
                               feature_type=INT32,
                               transform="(to_unix_timestamp(lpep_dropoff_datetime) - to_unix_timestamp(lpep_pickup_datetime))/60")

features = [
    f_trip_distance,
    f_trip_time_duration,
    Feature(name="f_is_long_trip_distance",
            feature_type=BOOLEAN,
            transform="cast_float(trip_distance)>30"),
    Feature(name="f_day_of_week",
            feature_type=INT32,
            transform="dayofweek(lpep_dropoff_datetime)"),
]

request_anchor = FeatureAnchor(name="request_features",
                               source=INPUT_CONTEXT,
                               features=features)

location_id = TypedKey(key_column="DOLocationID",
                       key_column_type=ValueType.INT32,
                       description="location id in NYC",
                       full_name="nyc_taxi.location_id")
agg_features = [Feature(name="f_location_avg_fare",
                        key=location_id,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="cast_float(fare_amount)",
                                                          agg_func="AVG",
                                                          window="90d")),
                Feature(name="f_location_max_fare",
                        key=location_id,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="cast_float(fare_amount)",
                                                          agg_func="MAX",
                                                          window="90d")),
                Feature(name="f_location_total_fare_cents",
                        key=location_id,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="fare_amount_cents",
                                                          agg_func="SUM",
                                                          window="90d")),
                ]

agg_anchor = FeatureAnchor(name="aggregationFeatures",
                           source=batch_source,
                           features=agg_features)

f_trip_time_distance = DerivedFeature(name="f_trip_time_distance",
                                      feature_type=FLOAT,
                                      input_features=[
                                          f_trip_distance, f_trip_time_duration],
                                      transform="f_trip_distance * f_trip_time_duration")

f_trip_time_rounded = DerivedFeature(name="f_trip_time_rounded",
                                     feature_type=INT32,
                                     input_features=[f_trip_time_duration],
                                     transform="f_trip_time_duration % 10")

feathr_client.build_features(anchor_list=[agg_anchor, request_anchor], derived_feature_list=[
                      f_trip_time_distance, f_trip_time_rounded])

feathr_client.register_features()

What component(s) does this bug affect?

mufajjul commented 1 year ago

@Yuqing-cat to support @Ritaja on this issue.

Yuqing-cat commented 1 year ago

For the first problem, add a log to inform user to use API way: https://github.com/feathr-ai/feathr/pull/892. For the second NGNIX problem, cannot repro it on my env. @Ritaja, could you help to dump the headers to understand why nginx rejects request from python client. The default nginx buffer size is 8k, that means client is send a request with header size over 8k, while our sample will not reach.

Ritaja commented 1 year ago

@Yuqing-cat I could reproduce through Azure deployment. Could you deploy Feathr components using guide here on azure and test with above code snippents ?