feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.63k stars 1k forks source link

Python SDK: start_offline_to_online_ingestion Fails with default ingestion jar configuration #1275

Closed jpugliesi closed 3 years ago

jpugliesi commented 3 years ago

Expected Behavior

In the minimal_ride_hailing.ipynb example notebook, I expect the following cell to run:

job = client.start_offline_to_online_ingestion(
    driver_statistics,
    datetime(2020, 10, 10),
    datetime(2020, 10, 20)
)
# expect offline to online ingestion job to run

Current Behavior

In the minimal_ride_hailing.ipynb example notebook, the following cell:

job = client.start_offline_to_online_ingestion(
    driver_statistics,
    datetime(2020, 10, 10),
    datetime(2020, 10, 20)
)

Produces the following error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-72-e6363419621a> in <module>
      2     driver_statistics,
      3     datetime(2020, 10, 10),
----> 4     datetime(2020, 10, 20)
      5 )

~/.local/lib/python3.7/site-packages/feast/client.py in start_offline_to_online_ingestion(self, feature_table, start, end)
   1065                 feature_table=feature_table,
   1066                 start=start,
-> 1067                 end=end,
   1068             )
   1069         else:

~/.local/lib/python3.7/site-packages/feast/pyspark/launcher.py in start_offline_to_online_ingestion(client, project, feature_table, start, end)
    250             ),
    251             deadletter_path=client._config.get(opt.DEADLETTER_PATH),
--> 252             stencil_url=client._config.get(opt.STENCIL_URL),
    253         )
    254     )

~/.local/lib/python3.7/site-packages/feast/pyspark/launchers/aws/emr.py in offline_to_online_ingestion(self, ingestion_job_params)
    256 
    257         jar_s3_path = _upload_jar(
--> 258             self._staging_location, ingestion_job_params.get_main_file_path()
    259         )
    260         step = _sync_offline_to_online_step(

~/.local/lib/python3.7/site-packages/feast/pyspark/launchers/aws/emr_utils.py in _upload_jar(jar_s3_prefix, local_path)
    127 
    128 def _upload_jar(jar_s3_prefix: str, local_path: str) -> str:
--> 129     with open(local_path, "rb") as f:
    130         return _s3_upload(
    131             f,

FileNotFoundError: [Errno 2] No such file or directory: 'https://storage.googleapis.com/feast-jobs/spark/ingestion/feast-ingestion-spark-develop.jar'

Steps to reproduce

  1. install feast
    pip install feast==0.8.2
  2. Run the cells in the example notebook. Note that I have configured the following Client fields, but not the spark_ingestion_jar config (this config works fine for defining features in feast):
    client = Client(
    core_url='feast-feast-core.feast-dev:6565',
    spark_launcher="emr",
    emr_cluster_id="<redacted>",
    emr_region="<redacted>",
    spark_staging_location="<redacted>",
    emr_log_location="<redacted>",
    historical_feature_output_location="<redacted>"
    )

Specifications

jpugliesi commented 3 years ago

I now see this appears related to #1266

jpugliesi commented 3 years ago

I get a similar error even when defining spark_ingestion_jar configuration on the client, i.e. spark_ingestion_jar=s3://my-bucket/feast-ingestion.jar (with a valid jar of course):

  1. The jar exists:

    $ aws s3 ls s3://my-bucket/feast-ingestion.jar
    2021-01-19 23:44:12   45031646 feast-ingestion.jar
  2. Failed attempt to kick off ingestion:

    
    job = client.start_offline_to_online_ingestion(
    driver_statistics,
    datetime(2020, 10, 10),
    datetime(2020, 10, 20)
    )

FileNotFoundError Traceback (most recent call last)

in 2 driver_statistics, 3 datetime(2020, 10, 10), ----> 4 datetime(2020, 10, 20) 5 ) ~/.local/lib/python3.7/site-packages/feast/client.py in start_offline_to_online_ingestion(self, feature_table, start, end) 1167 feature_table=feature_table, 1168 start=start, -> 1169 end=end, 1170 ) 1171 else: ~/.local/lib/python3.7/site-packages/feast/pyspark/launcher.py in start_offline_to_online_ingestion(client, project, feature_table, start, end) 280 ), 281 deadletter_path=client._config.get(opt.DEADLETTER_PATH), --> 282 stencil_url=client._config.get(opt.STENCIL_URL), 283 ) 284 ) ~/.local/lib/python3.7/site-packages/feast/pyspark/launchers/aws/emr.py in offline_to_online_ingestion(self, ingestion_job_params) 269 270 jar_s3_path = _upload_jar( --> 271 self._staging_location, ingestion_job_params.get_main_file_path() 272 ) 273 step = _sync_offline_to_online_step( ~/.local/lib/python3.7/site-packages/feast/pyspark/launchers/aws/emr_utils.py in _upload_jar(jar_s3_prefix, local_path) 82 83 def _upload_jar(jar_s3_prefix: str, local_path: str) -> str: ---> 84 with open(local_path, "rb") as f: 85 uri = urlparse(os.path.join(jar_s3_prefix, os.path.basename(local_path))) 86 return urlunparse( FileNotFoundError: [Errno 2] No such file or directory: 's3://my-bucket/feast-ingestion.jar' ```