big-data-europe / docker-hadoop

Apache Hadoop docker image
2.19k stars 1.28k forks source link

"New to Hadoop" use case #54

Open dav-ell opened 4 years ago

dav-ell commented 4 years ago

I'm completely new to Hadoop, and I found this repo because I had the thought, "Wow, installing Hadoop is hard, and all I want is HDFS. Surely there's got to be an easier way to do this. Maybe someone made a Docker container!"

Indeed, this repo does an amazing job of getting all the complicated details out of the way. But there's a number of questions that are left unanswered after getting this running. Thought it'd be useful to list them here:

These questions will probably be answered just by working with Hadoop more, but I thought they could help you guys if you're looking to address the new crowd. Lots of university students, especially those doing data science/engineering, are starting to feel the need to get familiar with tools like this.

dav-ell commented 4 years ago

How do I get started with uploading data to HDFS?

This question is answered by mounting your local dir in the datanode, like this:

volumes:
      - hadoop_datanode:/hadoop/dfs/data
      - /home/me:/home/me

Then using docker exec -it [datanote-id] hdfs dfs -put /home/me/file /hdfs/location/file.

2qif49lt commented 4 years ago

How do I get started with uploading data to HDFS? I tried using "Upload" in http://localhost:9870-> `Utilities->Browse the file system, and it failed.

i get into trouble as you are when i want use webhdfs to operate files. Upload in Utilities->Browse report fail that looks like webhdfs is not working.

heeeeeeeeeeeelp

2qif49lt commented 4 years ago

i kown, webhdfs restfull api return datanode's hostname, not its IP. browser cannot resolve it.

FHamster commented 4 years ago

i kown, webhdfs restfull api return datanode's hostname, not its IP. browser cannot resolve it.

Webhdfs restfull api will redirect request to datanode,but it use hostname.The networks of docker is separate with the host computer's,which means host computer can not connect with datanode in docker directly.

Thus,I add a forward proxy service using nginx into docker-compose file and set up proxy server in my browser.Then it works not very well.I can download file in HDFS use WebHDFS,but I have to change the hostname to IP address manually.

I noticed that you have the same problem.How can I get WebHDFS to return an IP address instead of datanode's hostname? @2qif49lt

otosky commented 4 years ago

There is a great answer by @earthquakesan on how to access and cp files to the hadoop fs here btw: https://github.com/big-data-europe/docker-hadoop-spark-workbench/issues/28#issuecomment-315528621