Open GoogleCodeExporter opened 8 years ago
Will submit the patch after review of issue39 (touches the same file)
Original comment by avrah...@gmail.com
on 7 Jul 2014 at 1:59
Fixed.
Patch attached
Original comment by avrah...@gmail.com
on 7 Jul 2014 at 8:28
Attachments:
Original comment by avrah...@gmail.com
on 12 Jul 2014 at 8:22
Original comment by avrah...@gmail.com
on 18 Jul 2014 at 8:05
-1 there are cases where we genuinely want to force the mime-type (e.g. when
the suffix used by a given site is misleading or when the server returns an
incorrect one). Removing this functionality is not the right approach.
Relying on Tika for detecting the mimetype is a good idea. What could be done
would be to keep the functionality as I implemented it but either :
* make so that the mimetype passed by the user is used only for the first file,
not the subsequent ones
* and/or give Tika a hint about the mimetype, which can be done via the
metadata in
[https://tika.apache.org/0.9/api/org/apache/tika/detect/Detector.html#detect(jav
a.io.InputStream, org.apache.tika.metadata.Metadata)]
Original comment by digitalpebble
on 21 Nov 2014 at 10:07
I agree that we need to let Tika determine what media type the url consists of.
I also like the idea of having our user, use your class with only a sitemap URL
in hand and we take care of everything else.
That's why I wrote code for the tool to point at the new parsing method (in
sitemapParser): public AbstractSiteMap parseSiteMap(URL onlineSitemapUrl)
That method uses Tika's first bytes magic (+filename) to detect the MediaType:
String contentType = tika.detect(bytes, filename);
Having the above reasoning, does your objection still stand ?
Original comment by avrah...@gmail.com
on 22 Nov 2014 at 4:10
Avi, you haven't addressed my concerns at all. Could you please re-read my
comments above and let me know what you think of my suggestions. The whole
point was to be able to influence the mimetype detection, which Tika allows us
to do.
Original comment by digitalpebble
on 26 Nov 2014 at 2:26
Just so I understand you, you think it best not to rely soley on Tika, but to
enable the user to add the specific mime type for the first iteration or add a
mime type hint, right ?
Original comment by avrah...@gmail.com
on 26 Nov 2014 at 3:16
Exactly. See
[https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/util/MimeU
til.java#L191] for an example of how we pass hints to the Detector in Nutch.
Original comment by digitalpebble
on 1 Dec 2014 at 11:08
Original comment by avrah...@gmail.com
on 12 Jan 2015 at 10:59
Original comment by avrah...@gmail.com
on 23 Jan 2015 at 8:00
Original issue reported on code.google.com by
avrah...@gmail.com
on 4 Jul 2014 at 9:26