big-data-europe / docker-hadoop-spark-workbench

[EXPERIMENTAL] This repo includes deployment instructions for running HDFS/Spark inside docker containers. Also includes spark-notebook and HDFS FileBrowser.
689 stars 374 forks source link

Copying files to HDFS #28

Closed ashemery closed 7 years ago

ashemery commented 7 years ago

Hello Ivan,

First, thanks for your response on Twitter, and for the whole project.

The issues, I'm facing is that I went through your Blog post here: https://medium.com/@ivanermilov/scalable-spark-hdfs-setup-using-docker-2fd0ffa1d6bf

I've created the network, and then used the commands in this repo to start my cluster. CMDs used:

docker-compose -f docker-compose-hive.yml up -d namenode hive-metastore-postgresql
docker-compose -f docker-compose-hive.yml up -d datanode hive-metastore
docker-compose -f docker-compose-hive.yml up -d hive-server
docker-compose -f docker-compose-hive.yml up -d spark-master spark-worker spark-notebook hue

Now all is working, and I can see that by checking the interfaces for each service. The only issue that isn't moving any step forward is copying files. I've done these with no help:

FIRST TRY: docker run -it --rm --env-file=../hadoop-hive.env --net hadoop uhopper/hadoop hadoop fs -mkdir -p /user/root I noticed that maybe this uhopper/hadoop isn't from the same cluster, so I did these: docker run -it --rm --env-file=../hadoop.env --volume $(pwd):/data --net hadoopbde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 hadoop fs -put /data/vannbehandlingsanlegg.csv /user/root

AND docker run -it --rm --env-file=../hadoop.env --volume $(pwd):/data --net hadoop bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 fs -put /data/vannbehandlingsanlegg.csv /user/root docker run -it --rm --env-file=../hadoop-hive.env --volume $(pwd):/data --net hadoop bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 fs -put /data/vannbehandlingsanlegg.csv /user/root None of these worked. All give me the same error message:

Configure host resolver to only use files
-mkdir: java.net.UnknownHostException: namenode
Usage: hadoop fs [generic options] -mkdir [-p] <path> ...

Notes:

Is there something I'm missing here?

Thank you.

ashemery commented 7 years ago

Another problem that I faced was this: _COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation

earthquakesan commented 7 years ago

Hi @ashemery!

thanks for interest in the project! -)

I've written that blog post a while ago and a lot of things are outdated there. For the simplicity let's use docker-compose.yml without Hive. You can switch to the Hive one later in case if you need it.

After you clone the repo, run the docker-compose:

docker-compose up

You will have a bunch of log output in your terminal. When it is stopped, go to http://localhost:50070 (namenode WebUI) and check if you have datanode available.

Now, the simplest way to upload the file to hdfs is to use Hue at this address: http://localhost:8088/accounts/login/?next=/home I have blocked index page for hue as it was hanging the whole app, therefore if you have error page you need to add /home to your query string in the browser. There use "File Browser", you can find it in the upper right corner.

Your second option is to bash into namenode container:

docker exec -it namenode bash
root@01d76b90e23f:/# touch emptyfile
root@01d76b90e23f:/# hadoop fs -moveFromLocal emptyfile /
root@01d76b90e23f:/# hadoop fs -ls /
Found 2 items
-rw-r--r--   3 root supergroup          0 2017-07-15 11:35 /emptyfile
drwxr-xr-x   - root supergroup          0 2017-07-15 11:14 /user

You can copy files from local file system to namenode container with docker cp command.

The third option is to run another container with mounted volume. First you need to figure out the correct docker network (created automatically with docker-compose in this case)

➜  docker-hadoop-spark-workbench git:(master) docker network ls
NETWORK ID          NAME                                 DRIVER              SCOPE
...
5e4d3c08cc89        dockerhadoopsparkworkbench_default   bridge              local
...

Then you can mount local folder into docker container and copy to hadoop from there (I copy my syslog to hadoop):

➜  docker-hadoop-spark-workbench git:(master) docker run -it --rm --net dockerhadoopsparkworkbench_default --volume /var/log/syslog:/syslog --env-file $(pwd)/hadoop.env bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 hadoop fs -copyFromLocal /syslog /
Configuring core
 - Setting hadoop.proxyuser.hue.hosts=*
 - Setting fs.defaultFS=hdfs://namenode:8020
 - Setting hadoop.proxyuser.hue.groups=*
 - Setting hadoop.http.staticuser.user=root
Configuring hdfs
 - Setting dfs.namenode.name.dir=file:///hadoop/dfs/name
 - Setting dfs.permissions.enabled=false
 - Setting dfs.webhdfs.enabled=true
Configuring yarn
 - Setting yarn.resourcemanager.fs.state-store.uri=/rmstate
 - Setting yarn.timeline-service.generic-application-history.enabled=true
 - Setting yarn.resourcemanager.recovery.enabled=true
 - Setting yarn.timeline-service.enabled=true
 - Setting yarn.log-aggregation-enable=true
 - Setting yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
 - Setting yarn.resourcemanager.system-metrics-publisher.enabled=true
 - Setting yarn.nodemanager.remote-app-log-dir=/app-logs
 - Setting yarn.resourcemanager.resource.tracker.address=resourcemanager:8031
 - Setting yarn.resourcemanager.hostname=resourcemanager
 - Setting yarn.timeline-service.hostname=historyserver
 - Setting yarn.log.server.url=http://historyserver:8188/applicationhistory/logs/
 - Setting yarn.resourcemanager.scheduler.address=resourcemanager:8030
 - Setting yarn.resourcemanager.address=resourcemanager:8032
Configuring httpfs
Configuring kms
Configuring for multihomed network
➜  docker-hadoop-spark-workbench git:(master) docker run -it --rm --net dockerhadoopsparkworkbench_default --volume /var/log/syslog:/syslog --env-file $(pwd)/hadoop.env bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 hadoop fs -ls /                  
Configuring core
 - Setting hadoop.proxyuser.hue.hosts=*
 - Setting fs.defaultFS=hdfs://namenode:8020
 - Setting hadoop.proxyuser.hue.groups=*
 - Setting hadoop.http.staticuser.user=root
Configuring hdfs
 - Setting dfs.namenode.name.dir=file:///hadoop/dfs/name
 - Setting dfs.permissions.enabled=false
 - Setting dfs.webhdfs.enabled=true
Configuring yarn
 - Setting yarn.resourcemanager.fs.state-store.uri=/rmstate
 - Setting yarn.timeline-service.generic-application-history.enabled=true
 - Setting yarn.resourcemanager.recovery.enabled=true
 - Setting yarn.timeline-service.enabled=true
 - Setting yarn.log-aggregation-enable=true
 - Setting yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
 - Setting yarn.resourcemanager.system-metrics-publisher.enabled=true
 - Setting yarn.nodemanager.remote-app-log-dir=/app-logs
 - Setting yarn.resourcemanager.resource.tracker.address=resourcemanager:8031
 - Setting yarn.resourcemanager.hostname=resourcemanager
 - Setting yarn.timeline-service.hostname=historyserver
 - Setting yarn.log.server.url=http://historyserver:8188/applicationhistory/logs/
 - Setting yarn.resourcemanager.scheduler.address=resourcemanager:8030
 - Setting yarn.resourcemanager.address=resourcemanager:8032
Configuring httpfs
Configuring kms
Configuring for multihomed network
Found 3 items
-rw-r--r--   3 root supergroup          0 2017-07-15 11:35 /emptyfile
-rw-r--r--   3 root supergroup      96264 2017-07-15 11:40 /syslog
drwxr-xr-x   - root supergroup          0 2017-07-15 11:14 /user
ashemery commented 7 years ago

Hello Ivan,

Thank you so much for your time writing this for me. I tested all of options, and they are all running successfully.

But now I'm on another issue. I tried to follow your steps to run the .csv file, but I'm getting issues. Do you want me to open another ticket or something?

BTW, is there a way to communicate with you (Email, Skype, DM via Twitter, etc)?

Thanks again for your help, totally appreciated.

earthquakesan commented 7 years ago

@ashemery please open another ticket. It's better to keep communications open in case if someone got the same questions.