amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

[Sitemaps] Add Tika Support #38

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Sitemaps should use the Tika implementations instead of using the current one 
in two places:

1. Currently the Parser has two public methods to activate it, both have an 
argument with the Media Type (Content type), I suggest adding two new parsing 
methods in which we will use Tika to detect the MediaType, the parsing methods 
would be as follows:

public AbstractSiteMap parseSiteMap(URL url);
public AbstractSiteMap parseSiteMap(File file);

The content of these methods will be something like:
byte[] bytes = IOUtils.toByteArray(onlineSitemapUrl);
String contentType = new Tika().detect(bytes);

return parseSiteMap(contentType, bytes, onlineSitemapUrl);

The new methods I suggest above will be very convenient for the light user who 
only wants to parse a simple sitemap without getting into any nitty gritty - I 
believe many people will appreciate it.

2. Change the Mime type parsing to use Tika's MediaTyep.
So instead of this code:
if (url.getPath().endsWith(".xml") || contentType.contains("text/xml") || 
contentType.contains("application/xml") || 
contentType.contains("application/x-xml")
                        || contentType.contains("application/atom+xml") || contentType.contains("application/rss+xml")) {

            // Try parsing the XML which could be in a number of formats
            return processXml(url, content);
        } else if (url.getPath().endsWith(".txt") || contentType.contains("text/plain")) {
            // plain text
            return (AbstractSiteMap) processText(content, url.toString());
        } else if (url.getPath().endsWith(".gz") || contentType.contains("application/gzip") || contentType.contains("application/x-gzip") || contentType.contains("application/x-gunzip")
                        || contentType.contains("application/gzipped") || contentType.contains("application/gzip-compressed") || contentType.contains("application/x-compress")
                        || contentType.contains("gzip/document") || contentType.contains("application/octet-stream")) {
            return processGzip(url, content);
        }

I want to use something like the following:
String mediaType = MediaType.parse(contentType).toString();
        if (mediaType.contains(MediaType.APPLICATION_XML.getSubtype())) {
            return processXml(url, content);
        } else if (mediaType.contains(MediaType.APPLICATION_ZIP.getSubtype())) {
            return processGzip(url, content);
        } else if (mediaType.contains(MediaType.TEXT_PLAIN.getType())) {
            return (AbstractSiteMap) processText(content, url.toString());
        }

Original issue reported on code.google.com by avrah...@gmail.com on 19 Apr 2014 at 8:20

GoogleCodeExporter commented 8 years ago
This issue has been superceeded by 39 & 40 and we are therefore closing it off.

Original comment by lewis.mc...@gmail.com on 26 Apr 2014 at 8:01