Closed ghukill closed 6 years ago
unified_diff
from python's difflib is looking like a great option:
t1 = get_r(1593275)
h1 = get_r(1593025)
diffs = difflib.unified_diff(h1.document.splitlines(), t1.document.splitlines())
for line in diffs:
print(line)
e.g. snippet:
+<?xml version="1.0" encoding="UTF-8"?>
+<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
+ <mods:titleInfo>
+ <mods:title>Edmund Dulac's fairy-book :</mods:title>
+ </mods:titleInfo>
+ <mods:subject>
+ <mods:topic xmlns:xlink="http://www.w3.org/1999/xlink"
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+ xmlns="http://www.openarchives.org/OAI/2.0/">Fairy tales</mods:topic>
+ </mods:subject>
+ <mods:abstract>"Edmund Dulac's fairy- book: fairy tales of the allied nations," was
published in 1916. It contains 'Snegorotchka: a Russian fairy tale,' 'The buried
moon: an English fairy tale,' 'White Caroline and black Caroline: a Flemish fairy
tale,' 'The seven conquerors of the Queen of the Mississippi: a Belgian fairy
@@ -165,75 +19,54 @@
friar and the boy: an English fairy tale,' 'The green serpent: a French fairy
tale,' 'Urashima Taro: a Japanese fairy tale,' and 'The fire bird: a Russian fairy
tale.'</mods:abstract>
-
-
- <mods:subject authority="lcsh">
-
-
- <mods:topic>Fairy tales</mods:topic>
-
-
- </mods:subject>
Record diff done.
For Records changed across entire Job, still need document hashing. Consider: https://docs.python.org/2/library/binascii.html#binascii.crc32 (in testing, about twice as fast as md5)
Use pyspark sql function to implement: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=w#pyspark.sql.functions.crc32
Example spark query for fingerprints mismatches (where j2
are transformed):
changed = j2.join(j1, j2['fingerprint'] == j1['fingerprint'], 'leftanti')
In [10]: changed.count()
Out[10]: 1
Question is, approach:
altered
Or, if we have the fingerprint, is that enough? Could a table be populated that show which Records were altered for a Job just by querying diff of fingerprints between current-job and input-job?
Though a new model for each mismatch is the simplest, and most performant for querying, if all records were altered -- and that's common for Transform -- that would effectively double the amount of rows. For 100k, 500k, 1m, that's a lot.
Getting closer in jobdiff
branch.
Looking to write transformed
column now that the fingerprint
column is populated for current and input jobs.
This code is close, but would require selecting each column individually (not the end of the world, but would be nice to be dynamic):
In [86]: df3 = df2.alias("df2").join(df1.alias("df1"), df1.fingerprint == df2.fingerprint, 'left').select(f.col('df2.id'), f.col('df2.record_id'), f.when(f.isnull(f.col('df1.fingerprint')), f.lit(Tr
...: ue)).otherwise(f.col('df2.transformed')).alias('transformed'))
In [87]: df3.columns
Out[87]: ['id', 'record_id', 'transformed']
This would be a dynamic version of that:
In [96]: df3 = df2.alias("df2").join(df1.alias("df1"), df1.fingerprint == df2.fingerprint, 'left').select(*['df2.%s' % c for c in df2.columns if c not in ['transformed']], f.col('df2.record_id'), f.
...: when(f.isnull(f.col('df1.fingerprint')), f.lit(True)).otherwise(f.col('df2.transformed')).alias('transformed'))
In [97]: df3.columns
Out[97]:
['id',
'combine_id',
'record_id',
'document',
'error',
'unique',
'unique_published',
'job_id',
'published',
'oai_set',
'success',
'valid',
'fingerprint',
'record_id',
'transformed']
Moved fingerprinting and transform column writing to TransformJob
in core.spark
.
Now that it's working, need to look at a couple of things:
fingerprint
and transformed
columns are written and carried over throughout
Implemented.
In discussing the role of Transform Jobs, and the potential for new, "ad-hoc" jobs based on find/replace or regex manipulation, became apparent that diffs between Jobs would be helpful.
One approach might be:
Possible diff algos and libraries: