Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

An Error Parsing MP4 Files #1

Closed kalhomoud closed 9 years ago

kalhomoud commented 10 years ago

It seems like a library is missing for MP4 parsing: Exception in thread "pool-1-thread-1" INFO [FilesystemCrawler] Projects: Re-processing orphan Files (if any)... java.lang.NoClassDefFoundError: org/aspectj/lang/Signature at org.apache.tika.parser.mp4.MP4Parser.parse(MP4Parser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at com.norconex.importer.parser.impl.AbstractTikaParser$RecursiveMetadataParser.parse(AbstractTikaParser.java:133) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:169) at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:135) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at com.norconex.importer.parser.impl.AbstractTikaParser$RecursiveMetadataParser.parse(AbstractTikaParser.java:133) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:169) at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:135) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at com.norconex.importer.parser.impl.AbstractTikaParser$RecursiveMetadataParser.parse(AbstractTikaParser.java:133) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at com.norconex.importer.parser.impl.AbstractTikaParser$RecursiveMetadataParser.parse(AbstractTikaParser.java:133) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:99) at com.norconex.importer.Importer.parseDocument(Importer.java:379) at com.norconex.importer.Importer.importDocument(Importer.java:266) at com.norconex.collector.fs.crawler.DocumentProcessor$ImportModuleStep.processDocument(DocumentProcessor.java:146) at com.norconex.collector.fs.crawler.DocumentProcessor.processURL(DocumentProcessor.java:91) at com.norconex.collector.fs.crawler.FilesystemCrawler.processNextQueuedFile(FilesystemCrawler.java:384) at com.norconex.collector.fs.crawler.FilesystemCrawler.processNextFile(FilesystemCrawler.java:310) at com.norconex.collector.fs.crawler.FilesystemCrawler.access$100(FilesystemCrawler.java:62) at com.norconex.collector.fs.crawler.FilesystemCrawler$ProcessFilesRunnable.run(FilesystemCrawler.java:545) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.aspectj.lang.Signature at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 44 more

essiembre commented 9 years ago

There is a snapshot release of Importer (2.1.0-SNAPSHOT) available for download that uses an updated version of Tika. It would be nice if you could try to reproduce with that version to confirm whether that's an issue Apache fixed already.

kalhomoud commented 9 years ago

Sorry, with all the Github emails I'm receiving, I must have missed this one. I will try to reproduce it now.

kalhomoud commented 9 years ago

I can confirm that this issue is no longer there with MP4 files.

INFO [CrawlerEventManager] DOCUMENT_METADATA_FETCHED: file:///Users/alhomoud/Music/small.mp4 (Subject: com.norconex.collector.fs.pipeline.importer.FileImporterPipeline$FileMetadataFetcherStage@1aba0e25) INFO [CrawlerEventManager] DOCUMENT_FETCHED: file:///Users/alhomoud/Music/small.mp4 (Subject: com.norconex.collector.fs.pipeline.importer.FileImporterPipeline$DocumentFetchStage@29e9dad8)

essiembre commented 9 years ago

Thanks for the feedback.

essiembre commented 9 years ago

Part of 2.1.0 release.