arimus / jmimemagic

jMimeMagic is a Java library for determining the MIME type of files or streams.
http://sourceforge.net/projects/jmimemagic/
Apache License 2.0
206 stars 58 forks source link

Html Sgml confusion #27

Closed ekremucar closed 9 years ago

ekremucar commented 9 years ago

i have tried to match an html file mime type detected sgml both starts with 'doctype' but html file continues with 'html' maybe it is required to order mathchers

aurelien-baudet commented 9 years ago

Same issue for me. Is there a way to make it work anyway ?

arimus commented 9 years ago

Currently, the following matchers already exist with a higher precedence than the sgml matcher:

<match>
    <mimetype>text/html</mimetype>
    <extension>html</extension>
    <description>HTML document text</description>
    <test offset="0" type="string" comparator="=">&lt;!DOCTYPE HTML</test>
</match>
<match>
    <mimetype>text/html</mimetype>
    <extension>html</extension>
    <description>HTML document text</description>
    <test offset="0" type="string" comparator="=">&lt;!doctype html</test>
</match>

Does your document have something other than exactly the following at position 0 in the file? Note that the default matchers are exact matches and don't ignore whitespace, etc.

<!DOCTYPE HTML or <!doctype html

aurelien-baudet commented 9 years ago

I found why the detection is not working. The file case is important and the file starts with: <!DOCTYPE html

There is no entry for this case. There is also no entry for the case <!doctype HTML.

Is there a way to indicate that the match is not case sensitive (maybe another comparator then =) ? If it doesn't exist, maybe it could be a good feature to add.

For the moment, I added two entries in my custom magic.xml file (but I also have to copy the dtd...).

arimus commented 9 years ago

You can, just not with the string matcher. You'll need to use the regex matcher type. See the magic.xml for a couple examples. Sorry for the bad paste above. There are existing matchers for this, which are actually regex already, they just aren't using the /i flag.

<match>
    <mimetype>text/html</mimetype>
    <extension>html</extension>
    <description>HTML Document</description>
    <test offset="0" type="regex" comparator="=">/^\s*&lt;!DOCTYPE HTML PUBLIC/</test>
</match>
<match>
    <mimetype>text/html</mimetype>
    <extension>html</extension>
    <description>HTML Document</description>
    <test offset="0" type="regex" comparator="=">/^\s*&lt;html&gt;/</test>
</match>