MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Transform Jobs diffs #169

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

In discussing the role of Transform Jobs, and the potential for new, "ad-hoc" jobs based on find/replace or regex manipulation, became apparent that diffs between Jobs would be helpful.

One approach might be:

Possible diff algos and libraries:

ghukill commented 6 years ago

unified_diff from python's difflib is looking like a great option:

t1 = get_r(1593275)
h1 = get_r(1593025)
diffs = difflib.unified_diff(h1.document.splitlines(), t1.document.splitlines())
for line in diffs:
    print(line)

e.g. snippet:

+<?xml version="1.0" encoding="UTF-8"?>
+<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
+   <mods:titleInfo>
+      <mods:title>Edmund Dulac's fairy-book :</mods:title>
+   </mods:titleInfo>
+   <mods:subject>
+      <mods:topic xmlns:xlink="http://www.w3.org/1999/xlink"
+                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+                  xmlns="http://www.openarchives.org/OAI/2.0/">Fairy tales</mods:topic>
+   </mods:subject>
+   <mods:abstract>"Edmund Dulac's fairy- book: fairy tales of the allied nations," was
                   published in 1916. It contains 'Snegorotchka: a Russian fairy tale,' 'The buried
                   moon: an English fairy tale,' 'White Caroline and black Caroline: a Flemish fairy
                   tale,' 'The seven conquerors of the Queen of the Mississippi: a Belgian fairy
@@ -165,75 +19,54 @@

                   friar and the boy: an English fairy tale,' 'The green serpent: a French fairy
                   tale,' 'Urashima Taro: a Japanese fairy tale,' and 'The fire bird: a Russian fairy
                   tale.'</mods:abstract>
-
-
-               <mods:subject authority="lcsh">
-
-
-                  <mods:topic>Fairy tales</mods:topic>
-
-
-               </mods:subject>
ghukill commented 6 years ago

Record diff done.

For Records changed across entire Job, still need document hashing. Consider: https://docs.python.org/2/library/binascii.html#binascii.crc32 (in testing, about twice as fast as md5)

ghukill commented 6 years ago

Use pyspark sql function to implement: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=w#pyspark.sql.functions.crc32

ghukill commented 6 years ago

Example spark query for fingerprints mismatches (where j2 are transformed):

changed = j2.join(j1, j2['fingerprint'] == j1['fingerprint'], 'leftanti')

In [10]: changed.count()
Out[10]: 1

Question is, approach:

Or, if we have the fingerprint, is that enough? Could a table be populated that show which Records were altered for a Job just by querying diff of fingerprints between current-job and input-job?

Though a new model for each mismatch is the simplest, and most performant for querying, if all records were altered -- and that's common for Transform -- that would effectively double the amount of rows. For 100k, 500k, 1m, that's a lot.

ghukill commented 6 years ago

Getting closer in jobdiff branch.

Looking to write transformed column now that the fingerprint column is populated for current and input jobs.

This code is close, but would require selecting each column individually (not the end of the world, but would be nice to be dynamic):

In [86]: df3 = df2.alias("df2").join(df1.alias("df1"), df1.fingerprint == df2.fingerprint, 'left').select(f.col('df2.id'), f.col('df2.record_id'), f.when(f.isnull(f.col('df1.fingerprint')), f.lit(Tr
    ...: ue)).otherwise(f.col('df2.transformed')).alias('transformed'))

In [87]: df3.columns
Out[87]: ['id', 'record_id', 'transformed']
ghukill commented 6 years ago

This would be a dynamic version of that:

In [96]: df3 = df2.alias("df2").join(df1.alias("df1"), df1.fingerprint == df2.fingerprint, 'left').select(*['df2.%s' % c for c in df2.columns if c not in ['transformed']], f.col('df2.record_id'), f.
    ...: when(f.isnull(f.col('df1.fingerprint')), f.lit(True)).otherwise(f.col('df2.transformed')).alias('transformed'))

In [97]: df3.columns
Out[97]: 
['id',
 'combine_id',
 'record_id',
 'document',
 'error',
 'unique',
 'unique_published',
 'job_id',
 'published',
 'oai_set',
 'success',
 'valid',
 'fingerprint',
 'record_id',
 'transformed']
ghukill commented 6 years ago

Moved fingerprinting and transform column writing to TransformJob in core.spark.

Now that it's working, need to look at a couple of things:

ghukill commented 6 years ago

Implemented.