TYPO3-Solr / ext-tika

A TYPO3 CMS extension that provides Apache Tika functionality
GNU General Public License v3.0
6 stars 29 forks source link

Make supportedFileTypes in extractors extension configuration #42

Closed thomashohn closed 1 year ago

thomashohn commented 7 years ago

It would be very nice if the supportedFileTypes were not hardcoded in the extractors but a list in the extension configuration since you might have sites where you would like to be able to configure this. I can provided a pull-request fixing this since I now have to XClass the extractors in order to control this.

timohund commented 7 years ago

@thomashohn Thanks i think this is a good idea. It would be nice to allow a configuration that limits the file extensions that are send to tika. If you can provide a patch, this would be nice!

thomashohn commented 7 years ago

@timohund There seems to be an some todo's on fetching this from the tika server - but I think i would prefer to have the extension configuring what the extract or not?

irnnr commented 7 years ago

I can't quite follow here what you want to achieve. With the last release we're querying Tika for supported file types already instead of having a hard-coded list. Am I missing something? Can you point to the concrete code you're referring to?

thomashohn commented 7 years ago

The files in https://github.com/TYPO3-Solr/ext-tika/tree/master/Classes/Service/Extractor - seems pretty hardcoded to me - or?

irnnr commented 7 years ago

I still don't know what you mean

thomashohn commented 7 years ago

I see the MetaDataExtractor changed to fetch data from TIKA :-) So if I would like to exclude files - I need to configure that in on TIKA server? For instance I don't want it to extract metadata for images or ?

irnnr commented 7 years ago

It seems it's a really slow Saturday morning for me since I really can't follow you.

Of course the EXT:tika fetches meta data from Tika, what else would you expect? It's been like that since forever.

Why wouldn't you want images meta data? That's data such as width, height, exposure, camera, geo location, description...

Please describe it to me in easy language^^ :)

thomashohn commented 7 years ago

Before the new release:

public function canProcess(File $file)
    {
        // TODO use MIME type instead of extension
        // tika.jar --list-supported-types -> cache supported types
        // compare to file's MIME type

        return in_array($file->getProperty('extension'),
            $this->supportedFileTypes);
    }

The $this->supportedFileTypes was a "hardcoded" array Now its:

 public function canProcess(File $file)
    {
        $tikaService = $this->getExtractor();
        $mimeTypes =  $tikaService->getSupportedMimeTypes();

        return in_array($file->getMimeType(), $mimeTypes);
    }

If I don't want to process say gif files - I would need to configure that on the TIKA server - or? Before I had to take gif out of the array $this->supportedFileTypes? So with the new version I have to be sure my TIKA server is configured to only send back the supported mime types i want to process?

irnnr commented 7 years ago

Ok, clear now, thanks! :)

However, it's still not clear why anyone would want to do that? Also, as you notice it had a TODO comment before :) - It was a missing feature. As you mentioned you modified the extension before. It was never something we supported so far. I'm not even sure Tika supports selectively enabling meta data extraction. If it does though, that's where I'd look.

I don't think this should or needs to be something EXT:tika does. (For the 95% of use cases)

thomashohn commented 7 years ago

If you buy images from iStock and other companies the images contains a lot of additional meta-information you don't want to extract beacause it will confuse your users when searching. I'll make a PR anyway since I fix it in my own code - then you can decide if it should be merged into EXT:tika or not ;-)

irnnr commented 7 years ago

Hmm, IMO that's usually pretty valuable meta data. maybe you can provide an example?

thomashohn commented 7 years ago

Hi - yes I can. 1) You have a lot of meta data files and start to use Solr and TIKA - your "old" valuable meta data will be overwritten - which is kind of annoying 2) Meta data in files does not match the kind of meta data you want. For instance for a iStock photo that could be the title. You might want another title or add info to the title - this is not possible.

I find the PR quite realistic and it comes from a real-world scenario :-)

dkd-dobberkau commented 7 years ago

A short sidenote from me. I see the usecase but i d rather like to discus this with you in the new year. TYPO3 is missing a meta data manager and therefore curation of meta data could be something that an add-on could offer.

thomashohn commented 7 years ago

Fine with me - as I said yearlier in the thread - I need to make a "fix" no matter what in my own code - since we can't retrieve meta-data from image files currently :-)

irnnr commented 7 years ago

Ok, I can see your use case (and your pain stemming from it), too now.

Now here's how I see the situation: IMO EXT:tika is a pure utility to extract meta data from files, a utility that is called/used by the TYPO3 core. The tika extension does not know about any existing meta data for a file that you might want to keep. Neither does the extension offer any custom mapping.

The mapping issue can be seen as a missing feature; I believe EXT:extractor offers something like that.

However, the extension's job is to simply provide meta data to the core. On that end I agree with Olivier, that what you describe is rather an issue that falls into the responsibility of the TYPO3 core.

So my suggestion would be: Feel free to open another issue for meta data property mapping, that would actually be useful to have. However, knowing about when to overwrite data in what cases is not (currently) in the domain of EXT:tika.

Advice for filing future issues:
I had to ask multiple times to understand your issue. The easier you can make it for us to understand your situation, the easier it will be for us to help you and/or agree with your issue. You should always provide as much information as possible. Read through this whole convo again and I hope you will see it was not easy to understand why/what issue you had. That saves us both a lot of time.

dkd-kaehm commented 1 year ago

Fixed in #48