MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Numerical limit filter for Transform Jobs throws off fingerprinting JOIN #228

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

When a numerical limit, input_numerical_valve, is applied to a Transform Job, it throws off the fingerprinting JOIN:

records_trans = records_trans.alias("records_trans").join(input_records.alias("input_records"), input_records.fingerprint == records_trans.fingerprint, 'left').select(*['records_trans.%s' % c for c in records_trans.columns if c not in ['transformed']], pyspark_sql_functions.when(pyspark_sql_functions.isnull(pyspark_sql_functions.col('input_records.fingerprint')), pyspark_sql_functions.lit(True)).otherwise(pyspark_sql_functions.lit(False)).alias('transformed'))

e.g. a limit of 1,000 might bring back 1,006 results, or 2,000 limit brings back 16k+

ghukill commented 6 years ago

False positive, creating new issue #229 with actual problem.