ebandal / H2Orestart

한컴파일을 LibreOffice에서 읽을 수 있도록 하는 extension입니다.
GNU General Public License v3.0
75 stars 6 forks source link

Consider detecting file type by mime type instead of extension #7

Closed deeplow closed 1 year ago

deeplow commented 1 year ago

As it stands, the plugin decides on the file based the extension, which I believe happens in the following lines

https://github.com/ebandal/H2Orestart/blob/df06ca5a9f931ae395bd379a8bae4e3b7d32e84f/source/soffice/WriterContext.java#L86-L92

However, on Linux systems, generally the file extension doesn't matter much and ideally the file can still be detected based on the mime type.

Would it be possible to use mime types instead of the extensions?

deeplow commented 1 year ago

A contributor of dangerzone broke down some of this nuance here:

MIME types

HWP and HWPX use custom MIME types that are not recognized by IANA. And one format has multiple MIME types, so they all need to be added. Some recommend application/vnd.hancom.*. But wildcard may not be supported on this code base and it may lead to security problems.

  • hwp application/x-hwp, application/haansofthwp, application/vnd.hancom.hwp
  • hwpx application/haansofthwpx, application/vnd.hancom.hwpx

Reference (in Korean)

ebandal commented 1 year ago

@deeplow thank you for advice. you're right. H2Orestart has detected with file extension wheter hwp or hwpx. it helps open hancom file quickly, because it doesn't need to parse file structure. I would agree this is not a safe way. regarding mime-type, Libreoffice never ask plugin with mime-type if plugin can support the file to open. Libreoffice asks extension only with file path if file can be supported on plugin. so I will choose a compromised method.
it is to detect file type as parsing actual file structure up to hwpx header or hwp header one by one, even it need more seconds until file open. plugin will not look file extension. this modification will be included in v0.5.6. thank you.

OctopusET commented 1 year ago

It's just idea but what about using other libraries like Apache Tika ?

deeplow commented 1 year ago

Thanks a lot @ebandal this worked!