bejean / crawl-anywhere

Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
www.crawl-anywhere.com
Apache License 2.0
96 stars 38 forks source link

Title not parsed correctly for some international sites. #83

Open aravinuthala opened 9 years ago

aravinuthala commented 9 years ago

Hi,

With the latest version of the Tika wrapper seem to have fixed some international sites, some sites are still experiencing the title being not parsed correctly.

Ex URL : http://www.houjin-keitai.com/guide/information/iphone/

The output file from the pipeline does not have the title correctly and when it is indexed into solr it is coming up as

????????iPhone 5 ?????????????????????????????????