Transform Jobs diffs - Githubissues

ghukill commented 6 years ago

In discussing the role of Transform Jobs, and the potential for new, "ad-hoc" jobs based on find/replace or regex manipulation, became apparent that diffs between Jobs would be helpful.

One approach might be:

fingerprint / hash all documents, for every Job
use this hash to determine if records have been altered across Jobs
at Transform Job details page, show how many records were changed in some way, and provide table that allows viewing that Record details
at Records details page, show diff between input Record and "current" Record
- this data is currently available, but is not presented well

Possible diff algos and libraries:

https://johnresig.com/projects/javascript-diff-algorithm/
python difflib

ghukill commented 6 years ago

unified_diff from python's difflib is looking like a great option:

t1 = get_r(1593275)
h1 = get_r(1593025)
diffs = difflib.unified_diff(h1.document.splitlines(), t1.document.splitlines())
for line in diffs:
    print(line)

e.g. snippet:

+<?xml version="1.0" encoding="UTF-8"?>
+<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
+   <mods:titleInfo>
+      <mods:title>Edmund Dulac's fairy-book :</mods:title>
+   </mods:titleInfo>
+   <mods:subject>
+      <mods:topic xmlns:xlink="http://www.w3.org/1999/xlink"
+                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+                  xmlns="http://www.openarchives.org/OAI/2.0/">Fairy tales</mods:topic>
+   </mods:subject>
+   <mods:abstract>"Edmund Dulac's fairy- book: fairy tales of the allied nations," was
                   published in 1916. It contains 'Snegorotchka: a Russian fairy tale,' 'The buried
                   moon: an English fairy tale,' 'White Caroline and black Caroline: a Flemish fairy
                   tale,' 'The seven conquerors of the Queen of the Mississippi: a Belgian fairy
@@ -165,75 +19,54 @@

                   friar and the boy: an English fairy tale,' 'The green serpent: a French fairy
                   tale,' 'Urashima Taro: a Japanese fairy tale,' and 'The fire bird: a Russian fairy
                   tale.'</mods:abstract>
-
-
-               <mods:subject authority="lcsh">
-
-
-                  <mods:topic>Fairy tales</mods:topic>
-
-
-               </mods:subject>

ghukill commented 6 years ago

Record diff done.

For Records changed across entire Job, still need document hashing. Consider: https://docs.python.org/2/library/binascii.html#binascii.crc32 (in testing, about twice as fast as md5)

ghukill commented 6 years ago

Use pyspark sql function to implement: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=w#pyspark.sql.functions.crc32

ghukill commented 6 years ago

Example spark query for fingerprints mismatches (where j2 are transformed):

changed = j2.join(j1, j2['fingerprint'] == j1['fingerprint'], 'leftanti')

In [10]: changed.count()
Out[10]: 1

Question is, approach:

create a new model where each mismatch is a row
add column to Record model like altered

Or, if we have the fingerprint, is that enough? Could a table be populated that show which Records were altered for a Job just by querying diff of fingerprints between current-job and input-job?

Though a new model for each mismatch is the simplest, and most performant for querying, if all records were altered -- and that's common for Transform -- that would effectively double the amount of rows. For 100k, 500k, 1m, that's a lot.

ghukill commented 6 years ago

Getting closer in jobdiff branch.

Looking to write transformed column now that the fingerprint column is populated for current and input jobs.

This code is close, but would require selecting each column individually (not the end of the world, but would be nice to be dynamic):

In [86]: df3 = df2.alias("df2").join(df1.alias("df1"), df1.fingerprint == df2.fingerprint, 'left').select(f.col('df2.id'), f.col('df2.record_id'), f.when(f.isnull(f.col('df1.fingerprint')), f.lit(Tr
    ...: ue)).otherwise(f.col('df2.transformed')).alias('transformed'))

In [87]: df3.columns
Out[87]: ['id', 'record_id', 'transformed']

ghukill commented 6 years ago

This would be a dynamic version of that:

In [96]: df3 = df2.alias("df2").join(df1.alias("df1"), df1.fingerprint == df2.fingerprint, 'left').select(*['df2.%s' % c for c in df2.columns if c not in ['transformed']], f.col('df2.record_id'), f.
    ...: when(f.isnull(f.col('df1.fingerprint')), f.lit(True)).otherwise(f.col('df2.transformed')).alias('transformed'))

In [97]: df3.columns
Out[97]: 
['id',
 'combine_id',
 'record_id',
 'document',
 'error',
 'unique',
 'unique_published',
 'job_id',
 'published',
 'oai_set',
 'success',
 'valid',
 'fingerprint',
 'record_id',
 'transformed']

ghukill commented 6 years ago

Moved fingerprinting and transform column writing to TransformJob in core.spark.

Now that it's working, need to look at a couple of things:

performance hit of these actions
whether or not fingerprint and transformed columns are written and carried over throughout
- ES index
- harvest, merge, publish, etc.

ghukill commented 6 years ago

Implemented.

MI-DPLA / combine

Transform Jobs diffs #169