asepaprianto / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Use Tika's MediaTypes instead of self parsing strings #280

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
In Utils.java we should use Tika to parse the Mime Type instead of guessing if 
it is binary by String parsing the content type.

Original issue reported on code.google.com by avrah...@gmail.com on 17 Aug 2014 at 4:30

GoogleCodeExporter commented 9 years ago
This is Jukka's answer about this subject:
Hi,

On Sun, Aug 10, 2014 at 1:46 AM, Avi Hayun <avraham2@gmail.com> wrote:
> How do I identify content types which can't be read as text (in notepad for
> example) because they have some binary content in them.

You can use use the media type relationship information stored in
Tika's type registry, like this:

    Tika tika = new Tika();
    MediaType type = MediaType.parse(tika.detect(...));

    MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry();
    if (registry.isSpecializationOf(MediaType.TEXT_PLAIN, type)) {
        // process text
    } else {
        // process binary
    }

> [...] if it finds text-parsable content, I want it to take the content as it 
is

Note that consuming text data can be surprisingly difficult given all
the different character encodings out there. Tika's parser classes
contain quite a bit of logic for automatically figuring out the
correct character encoding and other details needed for correctly
consuming text data.

What's your reason for wanting to process text data separately? Is
there some missing feature in Tika that would help achieve your use
case without the need for custom processing of text data?

For example the HTML parser supports the IdentityHtmlMapper feature
for skipping the HTML simplification that Tika does by default. To
activate that feature, you can pass an IdentityHtmlMapper instance in
the parse context:

    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, new IdentityHtmlMapper();

--
Jukka Zitting

Original comment by avrah...@gmail.com on 17 Aug 2014 at 5:00