"0 instance ids provided" errors while running feature generation on Spark
Bug/Feature Request Description
I'm trying to generate set of features using parallel execution on Spark cluster (Google's Dataproc).
Basically, my workflow looks like this:
import featuretools as ft
from pyspark.sql import SparkSession
from pyspark.sql.types import *
df = spark.read.format.jdbc ... # read raw data
def build_entity_set(df):
mock_entity_set = ft.EntitySet('foo').entity_from_dataframe(
entity_id='foo',
index='foo_id',
time_index='foo_created',
secondary_time_index={'foo_created': ['is_foo']},
dataframe=df,
variable_types={...}
).normalize_entity(
'foo',
'bar',
'bar_id',
additional_variables=[...],
make_time_index='bar_created'
)
mock_entity_set = build_entity_set(df.limit(10).toPandas())
mock_feature_matrix, features = ft.dfs(
entityset=mock_entity_set,
agg_primitives=[...], # set of library and custom primitives
trans_primitives=['time_since', 'is_weekend', 'weekday'],
drop_exact=[
'bar_id', ... # and some other Id columns
],
cutoff_time=mock_entity_set['foo'].df[['foo_id', 'foo_created', 'is_foo']],
target_entity='foo',
max_depth=10,
)
schema = StructType([
StructField(field.name, LongType(), True)
if type(field.dataType) is DecimalType
else
field
for field in self.spark.createDataFrame(mock_feature_matrix).schema.fields
])
def generate_features(pdf):
return ft.calculate_feature_matrix(
features=features,
entityset=build_entity_set(pdf)
).reset_index(drop=True)
dataset = (
df
.groupby('bar_id')
.applyInPandas(
generate_features,
schema
)
)
dataset\
.write\
.mode("overwrite")\
.format("parquet")\
.option("path", OUTPUT_PATH)\
.save()
So every now and then when I launch feature generation for large number of instances in entity set (7M+), I would see Spark tasks lost for the following reason:
21/11/25 16:37:25 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 14.2 in stage 102.0 (TID 1161) (foo-pyspark-w-11.us-central1-a.c.my-foo-project.internal executor 65): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/tmp/48bbb37c-f3bc-474b-8cdc-fc44d84c7d2d/create_foo_dataset.py", line 609, in generate_features
File "/opt/conda/miniconda3/envs/foo-env/lib/python3.9/site-packages/featuretools/computational_backends/calculate_feature_matrix.py", line 288, in calculate_feature_matrix
feature_matrix = calculate_chunk(cutoff_time=cutoff_time_to_pass,
File "/opt/conda/miniconda3/envs/foo-env/lib/python3.9/site-packages/featuretools/computational_backends/calculate_feature_matrix.py", line 363, in calculate_chunk
_feature_matrix = calculator.run(ids,
File "/opt/conda/miniconda3/envs/foo-env/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py", line 99, in run
assert len(instance_ids) > 0, "0 instance ids provided"
AssertionError: 0 instance ids provided
Pandas DataFrame input to generate_features() that triggers this error always has length 1 and does not seem to be special in any way (has all necessary columns and does not contain NaN's).
After 4 consecutive task losses with the same input Spark job is killed.
I wasn't able to reproduce this error using the same input DataFrame (of size 1) in local Spark runtime, so it doesn't seem to be a data issue.
Expected Output
I would expect featuretools library to have the same behavior locally and on spark cluster.
"0 instance ids provided" errors while running feature generation on Spark
Bug/Feature Request Description
I'm trying to generate set of features using parallel execution on Spark cluster (Google's Dataproc). Basically, my workflow looks like this:
So every now and then when I launch feature generation for large number of instances in entity set (7M+), I would see Spark tasks lost for the following reason:
Pandas DataFrame input to
generate_features()
that triggers this error always has length 1 and does not seem to be special in any way (has all necessary columns and does not containNaN
's).After 4 consecutive task losses with the same input Spark job is killed.
I wasn't able to reproduce this error using the same input DataFrame (of size 1) in local Spark runtime, so it doesn't seem to be a data issue.
Expected Output
I would expect
featuretools
library to have the same behavior locally and on spark cluster.Output of
featuretools.show_info()
(Dataproc)Output of
featuretools.show_info()
(local)