Open Udhay247 opened 4 years ago
As discussed, we would like to support a data file located in an object storage system to be used with FlashML.
I researched a bit, and there are a few options to replicate/create an Object Storage on Linux (e.g., your VM), something similar to what we can do by installing the hadoop package for HDFS:
Can install ceph on one node. This is pretty hard though and needs multiple tweaks, as it supports cluster setup by default. Some links:
https://linoxide.com/linux-how-to/hwto-configure-single-node-ceph-cluster/
https://dzone.com/articles/single-node-ceph-cluster-sandbox
Minio is another option, which seemed a bit easier. Link:
https://computingforgeeks.com/how-to-setup-s3-compatible-object-storage-server-with-minio/
Some tasks:
Setup Object Storage: Install and test out system on VM
Enhance FlashML to read a data file located in an object storage
Enhance our docker container to have the system used in #1, so that we can run unit tests
Another option that I am seeing is Apache Ozone, documented here: https://hadoop.apache.org/ozone/. This might be the easiest for us.
Let’s try to save the titanic dataset in IBM COS, and read it back through Spark.
This URL might be useful: https://www.ibm.com/blogs/bluemix/2018/06/big-data-layout/