amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

[Sitemaps] SiteMapParser Tika detection doesn't work well on some cases #47

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
When using the parse method which gets only a sitemap URL, we use Tika to 
detect the Mime type.

On some cases, the detection is bad.

We need to use a better Tika detection.

Use:
new Tika().detect(URL)

Instead of the current:
new Tika().detect(bytes)

Original issue reported on code.google.com by avrah...@gmail.com on 12 Jul 2014 at 8:29

GoogleCodeExporter commented 8 years ago
Will submit the patch after submission of issue40. (same file touched)

Original comment by avrah...@gmail.com on 12 Jul 2014 at 8:30

GoogleCodeExporter commented 8 years ago
Scenario where the bug can be reproduced:
Run the SitemapParser Tool on the following URL:
http://www.amazon.com/sitemap_video.xml

Original comment by avrah...@gmail.com on 12 Jul 2014 at 8:32

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
New parse(Url url) method introduced on issue39

Using the above method in SitemapTool: issue43 (not yet committed to svn)

Original comment by avrah...@gmail.com on 14 Jul 2014 at 1:21

GoogleCodeExporter commented 8 years ago
new Tika().detect(URL) -- Will solve the mentioned problem.

BUT it will cause out library to fetch the sitemap twice.

A better solution should be sought.
Maybe use new Tika().detect(bytes, filename);

Original comment by avrah...@gmail.com on 16 Jul 2014 at 5:24

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Jul 2014 at 8:05

GoogleCodeExporter commented 8 years ago
Let's instanciate the Tika instance only once and reuse it - otherwise we have 
to reload the Tika config everytime which is definitely not needed. (Julien)

Original comment by avrah...@gmail.com on 1 Aug 2014 at 3:51

GoogleCodeExporter commented 8 years ago
I will begin working on this one

Original comment by avrah...@gmail.com on 6 Aug 2014 at 7:10

GoogleCodeExporter commented 8 years ago
Attached is a patch with the required optimization.

Now the Tika detection is being called with the byte array + filename

The Tika object is being instantiated only once

Original comment by avrah...@gmail.com on 18 Aug 2014 at 8:10

Attachments:

GoogleCodeExporter commented 8 years ago
+ 1 ship it Thanks

Original comment by lewis.mc...@gmail.com on 18 Aug 2014 at 9:22

GoogleCodeExporter commented 8 years ago
Shipped in revision: r134

Original comment by avrah...@gmail.com on 19 Aug 2014 at 7:10