Closed ghukill closed 6 years ago
Related: #180
Framework in place for background tasks / file downloads.
One approach to the bg task <--> Livy chasm would be polling the Livy statement until done. It's not terribly expensive to Livy to check if Job done, and it would be a sure fire way of using the bg task as the primary means for determining a) how long and b) when done for the export.
Additionally, and perhaps even more important, after the export is complete the process of "bunching" parts together and renaming files could take place.
Worth considering, too, that these methods may be useful as an additional option for "publishing" routes. Perhaps adding a new publish Job (or the first) could trigger an export somewhere (as a bg task), that would be used as a flat form of all published records.
Finis! Opening separate ticket for cleanup for all export jobs.
Allow for the exporting of Records at various levels of hierarchy -- Jobs, Published, Record Group, etc. -- to XML files.
Some initial testing and thinking suggests that these will need to be "chunked" into smaller XML files. Thankfully, if exporting via spark, this would correlate witih partitions.
This spike code demonstrates exporting all Records from a Job as
part-####
files:Directory structure looks like this:
And the opening snippet of a file...
Some todos:
part
filepart
files do not have file extensionsIt would be fairly trivial to loop through and a) add root element to "bunch" records in a single file, then b) rename files. But this would likely happen from non-spark context, as that pyspark space is using the HDFS layer, though Combine writes to
file://
.