bingweiliu / Naive-Bayes-Classifier-Hadoop

4 stars 2 forks source link

regarding preprocessing dataset request #1

Open karthikeyana opened 9 years ago

karthikeyana commented 9 years ago

can you post data preprocessing program in our blog.

bingweiliu commented 9 years ago

Karthikeyana, what do you mean by posting the program to your blog? Where is your blog? The pre-processing program is a simple script to processing every interview into one line and remove unneeded items.

karthikeyana commented 9 years ago

import csv import glob import os

directory = raw_input("INPUT Folde:") output = raw_input("OUTPUT Folder:")

txt_files = os.path.join(directory, '*.txt')

for txt_file in glob.glob(txt_files): with open(txt_file, "rb") as input_file: in_txt = csv.reader(input_file, delimiter='=') filename = os.path.splitext(os.path.basename(txt_file))[0] + '.csv'

    with open(os.path.join(output, filename), 'wb') as output_file:
        out_csv = csv.writer(output_file)
        out_csv.writerows(in_txt)

sir i am using this code to convert all txt files to csv but i did not get this format sir plase help me

:POS: :41: i disagree with the reviewers who said the movie was predictable and drawn out it was a movie with heart and you could feel the main characters plight when he lost his companion being an animal lover i was pulling for the happy ending of course i am disney s biggest fan and i love this movie right along with the others p s i am a grandmother to eleven thank heavens for disney movies :POS: :85: sit back and enjoy the interesting and exciting story of the count of monte cristo great rainy day movie :POS: :95: a very well done film and an excellent cast i d put it right up with the three and four musketeers movies york reed chamberlain heston etc :POS: :96: this is an excellent movie and i never read the book the acting and the plot was very nice done it is one of my favorite movies

karthikeyana commented 9 years ago

sir can you post the script in command box

karthikeyana commented 9 years ago

15/03/05 02:26:39 INFO input.FileInputFormat: Total input paths to process : 2 15/03/05 02:26:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library 15/03/05 02:26:39 WARN snappy.LoadSnappy: Snappy native library not loaded 15/03/05 02:26:40 INFO mapred.JobClient: Running job: job_201503042232_0030 15/03/05 02:26:41 INFO mapred.JobClient: map 0% reduce 0% 15/03/05 02:26:59 INFO mapred.JobClient: map 100% reduce 0% 15/03/05 02:27:16 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000000_0, Status : FAILED java.lang.NumberFormatException: For input string: "1"" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.valueOf(Integer.java:582) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249)

15/03/05 02:27:16 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000001_0, Status : FAILED java.lang.NumberFormatException: For input string: "1"" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.valueOf(Integer.java:582) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249)

15/03/05 02:27:26 INFO mapred.JobClient: map 100% reduce 6% 15/03/05 02:27:28 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000000_1, Status : FAILED java.lang.NumberFormatException: For input string: "1"" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.valueOf(Integer.java:582) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249)

15/03/05 02:27:29 INFO mapred.JobClient: map 100% reduce 0% 15/03/05 02:27:29 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000001_1, Status : FAILED java.lang.NumberFormatException: For input string: "1"" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.valueOf(Integer.java:582) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249)

15/03/05 02:27:37 INFO mapred.JobClient: map 100% reduce 3% 15/03/05 02:27:38 INFO mapred.JobClient: map 100% reduce 6% 15/03/05 02:27:39 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000000_2, Status : FAILED java.lang.NumberFormatException: For input string: "1"" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.valueOf(Integer.java:582) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249)

15/03/05 02:27:40 INFO mapred.JobClient: map 100% reduce 3% 15/03/05 02:27:40 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000001_2, Status : FAILED java.lang.NumberFormatException: For input string: "1"" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.valueOf(Integer.java:582) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26) at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.Child.main(Child.java:249)

karthikeyana commented 9 years ago

this is my error message when i am running in single node hadoop