MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

in-place job work, add: record_id transform, dpla bulk data match #272

Closed ghukill closed 5 years ago

ghukill commented 5 years ago

Similar to field mapping and validations, would be helpful to support the following "in-place" work for a Job, to avoid duplicating records:

ghukill commented 5 years ago

For DPLA Bulk Data Match:

Possible approach:

from core.spark.console import *
from core.models import DPLABulkDataDownload

# ids
job_id = ???
dbdd_id = ??

# get full dbdd es
dbdd = DPLABulkDataDownload.objects.get(pk=dbdd_id)
dpla_df = get_job_es(spark, indices=[dbdd.es_index], doc_type='item')

# get job mapped fields
es_df = get_job_es(spark, job_id=job_id)

# join on isShownAt
matches_df = es_df.join(dpla_df, es_df['dpla_isShownAt'] == dpla_df['isShownAt'], 'leftsemi')

# select records_df for writing
update_dbdm_df = records_df.join(matches_df, records_df['_id']['oid'] == matches_df['db_id'], 'leftsemi')

# set dbdm column to match
update_dbdm_df.withColumn('dbdm', pyspark_sql_functions.lit(True))
ghukill commented 5 years ago

Done.