If Combine will handle tabular data / spreadsheets, at the very least, will need to likely:
harvest spreadsheets as static upload
support transformations that convert to XML
Spark natively reads .csv to dataframes:
dc_df = spark.read.csv('file:///home/combine/dc_spread.csv', header=True)
In [24]: dc_df.show()
+--------------------+--------------------+--------------------+
| dc_title| dc_description| dc_creator|
+--------------------+--------------------+--------------------+
|Breakfast of Cham...| A great book| Kurt Vonneget|
| Cats Cradle|Another gem of a ...| Kurt Vonneget|
| Moby Dick| A true classic| Herman Mellville|
|One Hundred Years...|Magical realism a...|Gabriel Garcia Ma...|
+--------------------+--------------------+--------------------+
Can then serialize to JSON to store under document field in Mongo for Record:
In [22]: dc_json_rdd.take(4)
Out[22]
['{"dc_title":"Breakfast of Champions","dc_description":"A great book","dc_creator":"Kurt Vonneget"}',
...
'{"dc_title":"One Hundred Years of Solitude","dc_description":"Magical realism at its finest","dc_creator":"Gabriel Garcia Marquez"}']
Changes this will require:
assuming Records are XML, attempting to parse as XML to determine if valid
Records will likely need a opinionated type field
xml, csv, json (looking forward)
display of non-XML in Record view
mapping Record
cannot use XML2kvp
serialized JSON will naturally deconstruct to fields for ElasticSearch?
If Combine will handle tabular data / spreadsheets, at the very least, will need to likely:
Spark natively reads .csv to dataframes:
Can then serialize to JSON to store under
document
field in Mongo forRecord
:Changes this will require: