Closed kristencheung closed 5 years ago
I'll look into this and get back to you folks thank you for posting.
Hi Folks, OK so if you print the help for the crawler_launcher
tool you will see the options which enable you to overcome this... the new command you will be using is as follows
./crawler_launcher
--filemgrUrl http://localhost:9000
--operation
--launchMetCrawler
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
--productPath /usr/local/coal-sds-deploy/data/staging
--metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
--metExtractorConfig /usr/local/coal-sds-deploy/data/met/tika.conf
--failureDir /usr/local/coal-sds-deploy/data/failure/
--daemonPort 9003
--daemonWait 2
--successDir /usr/local/coal-sds-deploy/data/archive/
--actionIds DeleteDataFile
As you can see above, the ultimate --actionIds DeleteDataFile
action, essentially states that we will delete the product upon successful ingest. This solves the issue for you.
Note, additionally, I've added a few other flags. We can discuss more tomorrow.
I would like someone to write up a document which covers the crawler component configuration and execution.
We have managed to set crawler_launcher to monitor data/staging, however it is unable to distinguish files that have already been ingested.
We have tried using the --successDir option to remove already-ingested files from data/staging, but it does not seem to work. Any suggestions?