HBase & Spark - Githubissues

achintya-kumar / BD2017

Otto-von-Guericke Universität Magdeburg - Big Data SoSe 2017

2 stars 0 forks source link

HBase & Spark #3

Closed achintya-kumar closed 7 years ago

achintya-kumar commented 7 years ago

Greetings, @HorizonNet, @tantalus1984 :) I have implemented a BloomFilter with a bitArraySize=10000, as required.

However, upon checking with size() method of the BitSet field, I get an output of 10048. Is it due some kind of overhead in the BitSet data structure?

I have spent some time looking into the code for any value that might lead to setting a bit position higher than 10k but did not find anything. The code is also mod-ing the hashes by 10k before adding it to the bitset.

What do you think could be the problem?

Here's the implementation for your reference: https://github.com/achintya-kumar/BD2017/blob/master/labs/3-hbase-spark/2-locations/src/main/java/bloomFilter/BloomFilter.java

achintya-kumar commented 7 years ago

Hi, @HorizonNet and @tantalus1984 , I have a question. I am using QuickStart VM. My MapReduce job which imports the Locations.csv into an HBase table runs perfectly fine on my VM. However, when I packaged it and tried running it on a real cluster of 2 nodes, the job gets 'submitted' but remains 'unassigned'. Google and StackOverflow did not help either. Do you know what the problem could be? Thanks! :)

HorizonNet commented 7 years ago

As said yesterday in the workshop, this problem can have multiple causes. The most common are:

Misconfigured cluster
Not enough resources available to start the job

Best starting point would be to go through the logs of the job or the role. Normally there should be an error or a warning.

achintya-kumar commented 7 years ago

Hi! Thanks for the response. You were right. Upon querying free memory, the node only had less than 1GB free (because we installed all the Hadoop services). This led to not having sufficient memory to even meet the minimum container requirements. We stopped every other service we didn't need for this particular task, including Cloudera Management Service. Then, we reduced the following parameter and it moved eventually:

yarn.scheduler.minimum-allocation-mb = 256
mapreduce.map.memory.mb = 256
mapreduce.reduce.memory.mb = 256
yarn.app.mapreduce.am.resource.mb = 256
mapreduce.task.io.sort.mb = 50

Thanks!

HorizonNet commented 7 years ago

Below is a short review.

Tasks

[x] HBase import
[x] HBase locations
[x] HBase Bloom
[x] Spark WordCount
[x] Spark import
[x] Spark Parquet & ORC
[x] Spark locations

Summary:

You're done with this one. Good work.