datatractor / schema

Schemas for Metadata Extractors
https://datatractor.github.io/schema/
MIT License
1 stars 0 forks source link

Implement hints for filetype detection #1

Open PeterKraus opened 4 months ago

PeterKraus commented 4 months ago

This issue is a follow-up of https://github.com/marda-alliance/metadata_extractors_schema/issues/45.

In https://github.com/marda-alliance/metadata_extractors_schema/pull/48, we have implemented the associated_file_extensions slot in the FileType schema, to specify some metadata that can be used to match files to FileTypes.

However, further hints could be included, such as common MIME types or magic bits. This idea needs a bit of planning work.

See also here:

This is my usecase. If someone uploads an arbitrary file to my ELN and I have a whole registry of tools to process it, the ELN still needs to figure out which tool to use. Identifying the FileType would give you the connection. Otherwise, I need to rely on the source (e.g. user) to tell me the type.

To apply a tool, the ELN needs to figure out the FileType one way or another. This is why you ask for a FileType identifier, right? Maybe it is difficult, but If you agree that it is a valid use-case, why wait for the next MaRDA WG to figure it out? I am not sure how additional information would reduce the useful-ness.

Let's say we are not using the registry to identify FileTypes. The tools in the registry still need to somehow tell what their intended input FileType is. And it ought to be more specific than JSON, HDF5, csv, etc. Why not describe the FileType by characteristics that would help to identify a file's type?

_Originally posted by @markus1978 in https://github.com/marda-alliance/metadata_extractors_schema/issues/9#issuecomment-1403190937_