247-ai / FlashML

FlashML from [24]7.ai: A library for automated model training on Apache Spark
Apache License 2.0
1 stars 3 forks source link

Read input data from Object Storage Systems #7

Open Udhay247 opened 4 years ago

Udhay247 commented 4 years ago

Let’s try to save the titanic dataset in IBM COS, and read it back through Spark.

This URL might be useful: https://www.ibm.com/blogs/bluemix/2018/06/big-data-layout/

Udhay247 commented 4 years ago

As discussed, we would like to support a data file located in an object storage system to be used with FlashML.

I researched a bit, and there are a few options to replicate/create an Object Storage on Linux (e.g., your VM), something similar to what we can do by installing the hadoop package for HDFS:

Can install ceph on one node. This is pretty hard though and needs multiple tweaks, as it supports cluster setup by default. Some links:

https://linoxide.com/linux-how-to/hwto-configure-single-node-ceph-cluster/

https://dzone.com/articles/single-node-ceph-cluster-sandbox

Minio is another option, which seemed a bit easier. Link:

https://computingforgeeks.com/how-to-setup-s3-compatible-object-storage-server-with-minio/

Some tasks:

  1. Setup Object Storage: Install and test out system on VM

  2. Enhance FlashML to read a data file located in an object storage

  3. Enhance our docker container to have the system used in #1, so that we can run unit tests

Udhay247 commented 4 years ago

Another option that I am seeing is Apache Ozone, documented here: https://hadoop.apache.org/ozone/. This might be the easiest for us.