bejean / crawl-anywhere

Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
www.crawl-anywhere.com
Apache License 2.0
96 stars 38 forks source link

Title not parsed correctly for some international sites. #46

Closed aravinuthala closed 9 years ago

aravinuthala commented 10 years ago

For this page,

http://insurance.rakuten.co.jp/ , the title is not getting parsed correctly. Checked the code and found out that this looks like a bug with the TikeWrapper being used.

Tile on page is "楽天: 保険一括見積もり|自動車保険・ペット保険・海外旅行保険を比較" Tile reported by crawler: ??: ??????????????????????????????

Thanks, Aditya

bejean commented 10 years ago

Where do you see ????????? ?

In log files ? in search interface ?

aravinuthala commented 10 years ago

Hi,

I had written a test case and when I look at the test case and print out the title it prints like that. I ran the crawl and saw the document in solr and it looks the same.

bejean commented 9 years ago

Title is correct with last tika-wrapper

The pipeline produces

url = http://insurance.rakuten.co.jp/ title = 楽天: 保険一括見積もり|自動車保険・ペット保険・海外旅行保険を比較