What steps will reproduce the problem?
1. use the following urls in seed.txt
http://www.epa.gov/espanol/
http://en.wikipedia.org/wiki/Infection_in_childcare
http://www.italianinelmondo.com/
http://www.urdupoint.com/
http://www.bbc.com/news/
http://www.eeas.europa.eu/delegations/cameroon/index_fr.htm
2. Use apache nutch1.8 custom built using cloudera's hadoop distribution 5.1.2
with solr 4.9
3. To allow only English language websites to be indexed on solr, use only en
in profiles directory and also modify the class LanguageDetector to give
priority only for English using the below link:
https://code.google.com/p/language-detection/wiki/NutchPlugin
follow the steps as mention in the above link.
What is the expected output? What do you see instead?
Only English language websites has to be indexed on solr.
What version of the product are you using? On what operating system?
OS: linux 64 bit
Apache nutch 1.8
Apache solr 4.9
Cloudera CDH 5.1.2
Please provide any additional information below.
The library is not consistent in filtering the languages. During few instance
only english is seen on solr admin and spanish is filtered out(when seed.txt
had only 1 english website and 1 spanish website)
when the no of links is around 5 having 2 english, 1 spanish, 1 italian, 1
french, 1 urdu web urls.
the output expected is : only 2 english urls on solr admin
Actual output: 2 english along with an italian, french, urdu urls are also
indexed.
Please help me soon with your help or any suggestions to get the correct
results.
Thanks,
Guru
Original issue reported on code.google.com by gururajf...@gmail.com on 10 Sep 2014 at 6:38
Original issue reported on code.google.com by
gururajf...@gmail.com
on 10 Sep 2014 at 6:38