MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Create Record subsets based on ES queries #233

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

Underway in essubset branch. Some spike code for issuing query and joining against DB records:

# as q GET param
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf={ "es.resource" : "j477/record","es.query" : "?q=mods_subject_topic:\"Scaffolding\"", "es.read.field.exclude":"*"})

# alternate, writing as query string
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf={ "es.resource" : "j477/record","es.query" : '{"query":{"match":{"mods_subject_topic":"Scaffolding"}}}',"es.read.field.exclude":"*"})

# convert to DataFrame
es_df = es_rdd.toDF()

# get job df
job_df = get_job_as_df(spark, 477)

# select union with combine_id
subset_df = job_df.join(es_df, job_df['combine_id'] == es_df['_1'], 'leftsemi')
ghukill commented 6 years ago

POF in place, next to address:

ghukill commented 6 years ago

Needs other bits and pieces to be user friendly, but it's functional. Closing.