amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

[Sitemaps] Fix the Tester Util's Logic #43

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
The scenario is running the SitemapTester on a sitemapIndex in a GZ file.

This is the current method stamp of the sitemapParser:
parse(URL url, String mt, boolean recursive)

As you see, the second argument is the MediaType, which never changes in the 
current tool.

So the problem occurred when one tried to parse a GZ file (so he sent 
"application/gzip" as media type) recursively, where inside the SitemapIndex 
were other sitemaps, for example plain/text, and our parser then tried to get 
the URLs from those sitemaps with the Media Type of "application/gz"...

The solution is to use the new method in issue39, where our library detects the 
MediaType on the fly using Tika.

Original issue reported on code.google.com by avrah...@gmail.com on 4 Jul 2014 at 9:26

GoogleCodeExporter commented 8 years ago
Will submit the patch after review of issue39 (touches the same file)

Original comment by avrah...@gmail.com on 7 Jul 2014 at 1:59

GoogleCodeExporter commented 8 years ago
Fixed.

Patch attached

Original comment by avrah...@gmail.com on 7 Jul 2014 at 8:28

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 12 Jul 2014 at 8:22

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 18 Jul 2014 at 8:05

GoogleCodeExporter commented 8 years ago
-1 there are cases where we genuinely want to force the mime-type (e.g. when 
the suffix used by a given site is misleading or when the server returns an 
incorrect one). Removing this functionality is not the right approach.

Relying on Tika for detecting the mimetype is a good idea. What could be done 
would be to keep the functionality as I implemented it but either : 
* make so that the mimetype passed by the user is used only for the first file, 
not the subsequent ones
* and/or give Tika a hint about the mimetype, which can be done via the 
metadata in  
[https://tika.apache.org/0.9/api/org/apache/tika/detect/Detector.html#detect(jav
a.io.InputStream, org.apache.tika.metadata.Metadata)]

Original comment by digitalpebble on 21 Nov 2014 at 10:07

GoogleCodeExporter commented 8 years ago
I agree that we need to let Tika determine what media type the url consists of.
I also like the idea of having our user, use your class with only a sitemap URL 
in hand and we take care of everything else.

That's why I wrote code for the tool to point at the new parsing method (in 
sitemapParser): public AbstractSiteMap parseSiteMap(URL onlineSitemapUrl)

That method uses Tika's first bytes magic (+filename) to detect the MediaType:
String contentType = tika.detect(bytes, filename);

Having the above reasoning, does your objection still stand ?

Original comment by avrah...@gmail.com on 22 Nov 2014 at 4:10

GoogleCodeExporter commented 8 years ago
Avi, you haven't addressed my concerns at all. Could you please re-read my 
comments above and let me know what you think of my suggestions. The whole 
point was to be able to influence the mimetype detection, which Tika allows us 
to do.

Original comment by digitalpebble on 26 Nov 2014 at 2:26

GoogleCodeExporter commented 8 years ago
Just so I understand you, you think it best not to rely soley on Tika, but to 
enable the user to add the specific mime type for the first iteration or add a 
mime type hint, right ?

Original comment by avrah...@gmail.com on 26 Nov 2014 at 3:16

GoogleCodeExporter commented 8 years ago
Exactly. See 
[https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/util/MimeU
til.java#L191] for an example of how we pass hints to the Detector in Nutch. 

Original comment by digitalpebble on 1 Dec 2014 at 11:08

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 12 Jan 2015 at 10:59

GoogleCodeExporter commented 8 years ago

Original comment by avrah...@gmail.com on 23 Jan 2015 at 8:00