Provide documentation in Readme.md for using hadoop for storage instead of local mode

jar349 commented 7 years ago

I'm using your docker images to create a hadoop cluster (defined in a docker-compose file). Now, I would like to add your hbase image, but it is configured to use local storage.

I could create my own image based on yours with a custom configuration file, or I could mount the config volume and place my own config file there for hbase to read. However, I think there's a simpler path: taking local or hdfs as an argument and doing the "right thing" on the user's behalf.

I am imagining something like command: hbase master local start or command: hbase master hdfs start where the values you'd need to configure site.xml to use hadoop would come from environment variables (-e HDFS_MASTER=<hostname>).

What do you think?

davidonlaptop commented 7 years ago

I agree with you that the documentation is not clear on how to use the hadoop and hbase docker images together. Using the environment variables is interesting way that fits well with the Docker approach.

You should consider that you may lose data locality with this method. As far I know, Hadoop is not yet docker aware. So if the datanode and regionserver runs in separate containers, they will have different IP addresses and hbase will assume that the 2 services are not local on the same machine. Therefore, the data access may not be optimal.

However many people uses S3 in production, and Hadoop can't figure out data locality with S3 either.

Can you elaborate more on your use case?

jar349 commented 7 years ago

Use case:

Building a library of compose files that I can, ahem... compose together, a la: https://docs.docker.com/compose/extends/#/multiple-compose-files

I've already got a zookeeper quorum, and I've got a distributed hadoop cluster (using your hadoop image to provide a name node, data note, and secondary name node.

Now I want a set of files that I can compose on top of zookeeper/hadoop: hbase, spark, kylin, etc.

So, this would be for local development and testing. But my goal is to try to mimick a realistic setup, meaning: more than one zk instance, hadoop secondary name node, more than one hbase region server, hbase actually using hadoop instead of local file system, etc.

dav-ell commented 4 years ago

I'd also appreciate this. This is the best hbase docker repo I can find (that works with Thrift), and having this described easily in the README would make this repository immensely powerful. Starting with no knowledge of HBase or HDFS, I'd be able to spin up a near-production-ready HDFS-backed HBase DB in 10 minutes. You have to admit, that's pretty cool.

Don't forget all the students out there coming out of school, getting their feet wet with big data tools, and floundering because of their complexity. This would go a good ways toward helping them.

davidonlaptop commented 4 years ago

Hi Dav and John Ruiz,

Sure! if you could be please submit a merge request, I'll have it approved and deployed.

-D

On Sat, Feb 15, 2020 at 5:34 PM dav-ell notifications@github.com wrote:

I'd also appreciate this. This is the best hbase docker repo I can find (that works with Thrift), and having this described easily in the README would make this repository immensely powerful. Starting with no knowledge of HBase or HDFS, I'd be able to spin up a near-production-ready HDFS-backed HBase DB in 10 minutes. You have to admit, that's pretty cool.

Don't forget all the students out there coming out of school, getting their feet wet with big data tools, and floundering because of their complexity. This would go a good ways toward helping them.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GELOG/docker-ubuntu-hbase/issues/7?email_source=notifications&email_token=AAACU3C3BXWAO4DKLKOPNZTRDBUXXA5CNFSM4CZQHYYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL3YQLA#issuecomment-586647596, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACU3FQPSH7P7QN73ZJFW3RDBUXXANCNFSM4CZQHYYA .

dav-ell commented 4 years ago

Thanks! I'll see what I can do.

Do you happen to know how to do it already? My progress on Hadoop in Docker has been slow. sequenceiq's is super old, big-data-europe's was giving me errors, and harisekhon's seems to work perfectly, so I was using that. However, trying to connect HBase to it hasn't been straightforward.

I had to change the configuration file (hdfs-site.xml) from the default (which was writing to /tmp) to:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///data</value>
    </property>
</configuration>

in order for it to write to a new directory (that's easier for me to mount). Then I run it using something like:

docker run -d --name hdfs -p 8042:8042 -p 8088:8088 -p 19888:19888 -p 50070:50070 -p 50075:50075 -v $HOME/hdfs-data:/data -v $HOME/hdfs-site.xml:/hadoop/etc/hadoop/hdfs-site.xml harisekhon/hadoop

After that, I feel pretty confident about HDFS being setup properly. However, to connect HBase to it, the best I've got so far is changing the hdfs url to:

hdfs://ip-of-docker-container:8020/

Does that look right?

dav-ell commented 4 years ago

Actually, that worked. Have any corrections before I add it to the readme?

dav-ell commented 4 years ago

Pull request #10 added.

GELOG / docker-ubuntu-hbase

Provide documentation in Readme.md for using hadoop for storage instead of local mode #7