TYPO3-Solr / ext-tika

A TYPO3 CMS extension that provides Apache Tika functionality
GNU General Public License v3.0
6 stars 29 forks source link

[DISCUSSION] Drop Tika app and Solr Cell support? #14

Open irnnr opened 8 years ago

irnnr commented 8 years ago

There might be some reasons to drop support for Tika app and Solr support:

If we were to decide to do that, it would also result in a new major version as it is a breaking change. Nothing is set in stone or even decided yet. We're just looking for opinions for now.

timohund commented 8 years ago

@irnnr I would at least propose to drop the solr cell support because a lot of the features are not supported by solr cell.

dkd-friedrich commented 8 years ago

Though the Solr Cell doesn't support all features and the local Tika app/server is more performant, I think we should keep the Solr Cell support. There are a lot of TYPO3 installations where no local Java installation is available and therefore depend on Solr Cell (e.g. solrfal & text extraction)

LeoniePhiline commented 7 years ago

Tika App is very important for me, since I need to extract metadata of lots and lots of mp3 files (and pdfs, but these are smaller) in a TYPO3 installation. Sending these gigabytes to the solr server for metadata extraction creates much more overhead, timeouts, headaches and delay than firing up the tika app.

Therefore, please keep Tika App support! :)

By the way, I also had to add another memory-expanding argument to the tika command: -Xmx512M, to avoid the Java VM error "Could not reserve enough space for object heap".

irnnr commented 7 years ago

@LeoniePhiline thanks for your input! Can you please open a separate issue for the memory flag so that it can be taken care of?

AndreasA commented 7 years ago

I also think solr extraction should be kept as that way one can use solrfal if the TYPO3 server itself has no Java (or one cannot install the tika server, etc.) Also for most cases solr will be enough and one doesn't have to setup and maintain a tika server.

EDIT: However, maybe one could add the advantages and disadvantages (e.g. what doesn't work when using solr cell) of the various types in .md file.

LeoniePhiline commented 7 years ago

Right now I have a nice case where with the same site on some environments I can use the tika app (jar), but on other environments I need to switch extconf['extractor'] from 'jar' to 'solr'. Works quite well.

I only had to add $GLOBALS['TYPO3_CONF_VARS']['SYS']['FileInfo']['fileExtensionToMimeType']['mp3'] = 'audio/mpeg3'; for \ApacheSolrForTypo3\Tika\Service\Tika\SolrCellService::getSupportedMimeTypes() to match. (Although solr then returns the 'audio/mpeg' mimetype for the extracted mp3 file - so rather getSupportedMimeTypes() should be extended by adding 'audio/mpeg'.)

And to add in \ApacheSolrForTypo3\Tika\Service\Extractor\MetaDataExtractor::normalizeMetaData() a mapping of xmpDM:durataion to (int)($value / 1000).

EXT:extractor has a nicely configurable metadata mapping (normalization) handling. There no code change would be necessary - but EXT:extractor does not support SolrCell, only Tika App local or Tika server.