exporting Jobs / Published / Records as XML

ghukill commented 6 years ago

Allow for the exporting of Records at various levels of hierarchy -- Jobs, Published, Record Group, etc. -- to XML files.

Some initial testing and thinking suggests that these will need to be "chunked" into smaller XML files. Thankfully, if exporting via spark, this would correlate witih partitions.

This spike code demonstrates exporting all Records from a Job as part-#### files:

import math
import uuid

# get job as df with convenience method
from core.spark.console import *
df = get_job_as_df(spark, 17)

# how many XML records per "part"
perpage = 500

df.select('document').rdd.repartition(math.ceil(df.count()/perpage)).map(lambda row: row.document.replace('<?xml version="1.0" encoding="UTF-8"?>','')).saveAsTextFile('file:///home/combine/%s' % str(uuid.uuid4()))

Directory structure looks like this:

/combinehome/d27528c3-07a2-4a0c-a0cd-4e684a9c6d43 $ ls -alh
total 624K
drwxrwxr-x 1 1001 1002 4.0K Jun  5 13:37 .
drwxr-xr-x 1 1001 1002 4.0K Jun  5 14:10 ..
-rw-r--r-- 1 1001 1002 215K Jun  5 13:37 part-00000
-rw-r--r-- 1 1001 1002 1.7K Jun  5 13:37 .part-00000.crc
-rw-r--r-- 1 1001 1002 180K Jun  5 13:37 part-00001
-rw-r--r-- 1 1001 1002 1.5K Jun  5 13:37 .part-00001.crc
-rw-r--r-- 1 1001 1002 201K Jun  5 13:37 part-00002
-rw-r--r-- 1 1001 1002 1.6K Jun  5 13:37 .part-00002.crc
-rw-r--r-- 1 1001 1002    0 Jun  5 13:37 _SUCCESS
-rw-r--r-- 1 1001 1002    8 Jun  5 13:37 ._SUCCESS.crc

And the opening snippet of a file...


<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
   <mods:titleInfo>
      <mods:title>Ma?rchen nach Perrault</mods:title>
   </mods:titleInfo>
   <mods:subject>
      <mods:topic>Children's stories</mods:topic>
   </mods:subject>
....
</mods:mods>

<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
   <mods:titleInfo>
      <mods:title>Lichtenstein</mods:title>
   </mods:titleInfo>
   <mods:subject>
...

Some todos:

currently no root XML element per part file
part files do not have file extensions

It would be fairly trivial to loop through and a) add root element to "bunch" records in a single file, then b) rename files. But this would likely happen from non-spark context, as that pyspark space is using the HDFS layer, though Combine writes to file://.

ghukill commented 6 years ago

Related: #180

ghukill commented 6 years ago

Framework in place for background tasks / file downloads.

One approach to the bg task <--> Livy chasm would be polling the Livy statement until done. It's not terribly expensive to Livy to check if Job done, and it would be a sure fire way of using the bg task as the primary means for determining a) how long and b) when done for the export.

Additionally, and perhaps even more important, after the export is complete the process of "bunching" parts together and renaming files could take place.

Worth considering, too, that these methods may be useful as an additional option for "publishing" routes. Perhaps adding a new publish Job (or the first) could trigger an export somewhere (as a bg task), that would be used as a flat form of all published records.

ghukill commented 6 years ago

Finis! Opening separate ticket for cleanup for all export jobs.

MI-DPLA / combine

exporting Jobs / Published / Records as XML #179