KalyanHadoopRealTimeProjects-1 / project-batch1-team2

project-batch1-team2
http://www.bigdatatraininghyderabad.com
1 stars 2 forks source link

Real Time Big Data Projects Team Discussions #1

Open kalyanhadooptraining opened 7 years ago

anjijava16 commented 7 years ago

How to do dynamic partition based on the Data ?

sandeepbommakanti commented 7 years ago

How should we read pdf and image file through Map Reduce?

anjijava16 commented 7 years ago

How to take mutliple files(in the same direcory) let us suppose in the directory as /app/dev (have abc.txt,abc.csv,abc.json ,abc.xml and abc.tsv) files

sandeepbommakanti commented 7 years ago

MapReduceTask_1: ➢ Input can be any format like text, pdf, xml, json ➢ Partition the given data based on Country and Status ➢ Output can be any format like text, pdf, xml, json Solution:

  1. Instead of using Partitioner we have to ensure that we create multiple outputs based on Country and status.
  2. Use Custom Input Format and Custom Output Format
sandeepbommakanti commented 7 years ago

MapReduceTask_2: ➢ Input can be any format like text, pdf, xml, json ➢ Find the top 10 Countries based on their status is SUCCESS ➢ Output can be any format like text, pdf, xml, json Solution:

  1. We have use 2 jobs instead of 1
  2. First Job should give output as Country and Count()
  3. Output of First Job will be input for 2nd Job
  4. 2nd Job will apply order by logic SQL Query Select country, count(1) as cnt from eventlog where status = 'SUCCESS' group by country order by cnt desc limit 10;