MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Tabular Data / Spreadsheets as Record type #309

Open ghukill opened 5 years ago

ghukill commented 5 years ago

If Combine will handle tabular data / spreadsheets, at the very least, will need to likely:

Spark natively reads .csv to dataframes:

dc_df = spark.read.csv('file:///home/combine/dc_spread.csv', header=True)
In [24]: dc_df.show()
+--------------------+--------------------+--------------------+
|            dc_title|      dc_description|          dc_creator|
+--------------------+--------------------+--------------------+
|Breakfast of Cham...|        A great book|       Kurt Vonneget|
|         Cats Cradle|Another gem of a ...|       Kurt Vonneget|
|           Moby Dick|      A true classic|    Herman Mellville|
|One Hundred Years...|Magical realism a...|Gabriel Garcia Ma...|
+--------------------+--------------------+--------------------+

Can then serialize to JSON to store under document field in Mongo for Record:

In [22]: dc_json_rdd.take(4)
Out[22]
['{"dc_title":"Breakfast of Champions","dc_description":"A great book","dc_creator":"Kurt Vonneget"}',
...
 '{"dc_title":"One Hundred Years of Solitude","dc_description":"Magical realism at its finest","dc_creator":"Gabriel Garcia Marquez"}']

Changes this will require: