Closed ghukill closed 5 years ago
For DPLA Bulk Data Match:
dbdm
to False
for all records in JobPossible approach:
from core.spark.console import *
from core.models import DPLABulkDataDownload
# ids
job_id = ???
dbdd_id = ??
# get full dbdd es
dbdd = DPLABulkDataDownload.objects.get(pk=dbdd_id)
dpla_df = get_job_es(spark, indices=[dbdd.es_index], doc_type='item')
# get job mapped fields
es_df = get_job_es(spark, job_id=job_id)
# join on isShownAt
matches_df = es_df.join(dpla_df, es_df['dpla_isShownAt'] == dpla_df['isShownAt'], 'leftsemi')
# select records_df for writing
update_dbdm_df = records_df.join(matches_df, records_df['_id']['oid'] == matches_df['db_id'], 'leftsemi')
# set dbdm column to match
update_dbdm_df.withColumn('dbdm', pyspark_sql_functions.lit(True))
Done.
Similar to field mapping and validations, would be helpful to support the following "in-place" work for a Job, to avoid duplicating records:
record_id
with RITS