Closed ashemery closed 7 years ago
Another problem that I faced was this:
_COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation
Hi @ashemery!
thanks for interest in the project! -)
I've written that blog post a while ago and a lot of things are outdated there. For the simplicity let's use docker-compose.yml without Hive. You can switch to the Hive one later in case if you need it.
After you clone the repo, run the docker-compose:
docker-compose up
You will have a bunch of log output in your terminal. When it is stopped, go to http://localhost:50070 (namenode WebUI) and check if you have datanode available.
Now, the simplest way to upload the file to hdfs is to use Hue at this address: http://localhost:8088/accounts/login/?next=/home I have blocked index page for hue as it was hanging the whole app, therefore if you have error page you need to add /home to your query string in the browser. There use "File Browser", you can find it in the upper right corner.
Your second option is to bash into namenode container:
docker exec -it namenode bash
root@01d76b90e23f:/# touch emptyfile
root@01d76b90e23f:/# hadoop fs -moveFromLocal emptyfile /
root@01d76b90e23f:/# hadoop fs -ls /
Found 2 items
-rw-r--r-- 3 root supergroup 0 2017-07-15 11:35 /emptyfile
drwxr-xr-x - root supergroup 0 2017-07-15 11:14 /user
You can copy files from local file system to namenode container with docker cp
command.
The third option is to run another container with mounted volume. First you need to figure out the correct docker network (created automatically with docker-compose in this case)
➜ docker-hadoop-spark-workbench git:(master) docker network ls
NETWORK ID NAME DRIVER SCOPE
...
5e4d3c08cc89 dockerhadoopsparkworkbench_default bridge local
...
Then you can mount local folder into docker container and copy to hadoop from there (I copy my syslog to hadoop):
➜ docker-hadoop-spark-workbench git:(master) docker run -it --rm --net dockerhadoopsparkworkbench_default --volume /var/log/syslog:/syslog --env-file $(pwd)/hadoop.env bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 hadoop fs -copyFromLocal /syslog /
Configuring core
- Setting hadoop.proxyuser.hue.hosts=*
- Setting fs.defaultFS=hdfs://namenode:8020
- Setting hadoop.proxyuser.hue.groups=*
- Setting hadoop.http.staticuser.user=root
Configuring hdfs
- Setting dfs.namenode.name.dir=file:///hadoop/dfs/name
- Setting dfs.permissions.enabled=false
- Setting dfs.webhdfs.enabled=true
Configuring yarn
- Setting yarn.resourcemanager.fs.state-store.uri=/rmstate
- Setting yarn.timeline-service.generic-application-history.enabled=true
- Setting yarn.resourcemanager.recovery.enabled=true
- Setting yarn.timeline-service.enabled=true
- Setting yarn.log-aggregation-enable=true
- Setting yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
- Setting yarn.resourcemanager.system-metrics-publisher.enabled=true
- Setting yarn.nodemanager.remote-app-log-dir=/app-logs
- Setting yarn.resourcemanager.resource.tracker.address=resourcemanager:8031
- Setting yarn.resourcemanager.hostname=resourcemanager
- Setting yarn.timeline-service.hostname=historyserver
- Setting yarn.log.server.url=http://historyserver:8188/applicationhistory/logs/
- Setting yarn.resourcemanager.scheduler.address=resourcemanager:8030
- Setting yarn.resourcemanager.address=resourcemanager:8032
Configuring httpfs
Configuring kms
Configuring for multihomed network
➜ docker-hadoop-spark-workbench git:(master) docker run -it --rm --net dockerhadoopsparkworkbench_default --volume /var/log/syslog:/syslog --env-file $(pwd)/hadoop.env bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 hadoop fs -ls /
Configuring core
- Setting hadoop.proxyuser.hue.hosts=*
- Setting fs.defaultFS=hdfs://namenode:8020
- Setting hadoop.proxyuser.hue.groups=*
- Setting hadoop.http.staticuser.user=root
Configuring hdfs
- Setting dfs.namenode.name.dir=file:///hadoop/dfs/name
- Setting dfs.permissions.enabled=false
- Setting dfs.webhdfs.enabled=true
Configuring yarn
- Setting yarn.resourcemanager.fs.state-store.uri=/rmstate
- Setting yarn.timeline-service.generic-application-history.enabled=true
- Setting yarn.resourcemanager.recovery.enabled=true
- Setting yarn.timeline-service.enabled=true
- Setting yarn.log-aggregation-enable=true
- Setting yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
- Setting yarn.resourcemanager.system-metrics-publisher.enabled=true
- Setting yarn.nodemanager.remote-app-log-dir=/app-logs
- Setting yarn.resourcemanager.resource.tracker.address=resourcemanager:8031
- Setting yarn.resourcemanager.hostname=resourcemanager
- Setting yarn.timeline-service.hostname=historyserver
- Setting yarn.log.server.url=http://historyserver:8188/applicationhistory/logs/
- Setting yarn.resourcemanager.scheduler.address=resourcemanager:8030
- Setting yarn.resourcemanager.address=resourcemanager:8032
Configuring httpfs
Configuring kms
Configuring for multihomed network
Found 3 items
-rw-r--r-- 3 root supergroup 0 2017-07-15 11:35 /emptyfile
-rw-r--r-- 3 root supergroup 96264 2017-07-15 11:40 /syslog
drwxr-xr-x - root supergroup 0 2017-07-15 11:14 /user
Hello Ivan,
Thank you so much for your time writing this for me. I tested all of options, and they are all running successfully.
But now I'm on another issue. I tried to follow your steps to run the .csv file, but I'm getting issues. Do you want me to open another ticket or something?
BTW, is there a way to communicate with you (Email, Skype, DM via Twitter, etc)?
Thanks again for your help, totally appreciated.
@ashemery please open another ticket. It's better to keep communications open in case if someone got the same questions.
Hello Ivan,
First, thanks for your response on Twitter, and for the whole project.
The issues, I'm facing is that I went through your Blog post here: https://medium.com/@ivanermilov/scalable-spark-hdfs-setup-using-docker-2fd0ffa1d6bf
I've created the network, and then used the commands in this repo to start my cluster. CMDs used:
Now all is working, and I can see that by checking the interfaces for each service. The only issue that isn't moving any step forward is copying files. I've done these with no help:
FIRST TRY: docker run -it --rm --env-file=../hadoop-hive.env --net hadoop uhopper/hadoop hadoop fs -mkdir -p /user/root I noticed that maybe this uhopper/hadoop isn't from the same cluster, so I did these: docker run -it --rm --env-file=../hadoop.env --volume $(pwd):/data --net hadoopbde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 hadoop fs -put /data/vannbehandlingsanlegg.csv /user/root
AND
docker run -it --rm --env-file=../hadoop.env --volume $(pwd):/data --net hadoop bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 fs -put /data/vannbehandlingsanlegg.csv /user/root
docker run -it --rm --env-file=../hadoop-hive.env --volume $(pwd):/data --net hadoop bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8 fs -put /data/vannbehandlingsanlegg.csv /user/root
None of these worked. All give me the same error message:Notes:
HOST_RESOLVER=files_only
And added an entry in my /etc/hosts for the namenode. But still nothing!Is there something I'm missing here?
Thank you.