capstone-coal / coal-sds

An Apache OODT-powered Science Data System for COAL
Apache License 2.0
2 stars 3 forks source link

Crawl Controller Daemon repeatedly ingests already-ingested files #25

Closed kristencheung closed 5 years ago

kristencheung commented 5 years ago

We have managed to set crawler_launcher to monitor data/staging, however it is unable to distinguish files that have already been ingested.

./crawler_launcher --filemgrUrl http://localhost:9000 --operation --launchMetCrawler --clientTransferer org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory --productPath /usr/local/coal-sds-deploy/data/staging --metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor --metExtractorConfig /usr/local/coal-sds-deploy/data/met/tika.conf --daemonPort 8000 -dw 5

We have tried using the --successDir option to remove already-ingested files from data/staging, but it does not seem to work. Any suggestions?

lewismc commented 5 years ago

I'll look into this and get back to you folks thank you for posting.

lewismc commented 5 years ago

Hi Folks, OK so if you print the help for the crawler_launcher tool you will see the options which enable you to overcome this... the new command you will be using is as follows

./crawler_launcher 
  --filemgrUrl http://localhost:9000 
  --operation 
  --launchMetCrawler 
  --clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory 
  --productPath /usr/local/coal-sds-deploy/data/staging 
  --metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor 
  --metExtractorConfig /usr/local/coal-sds-deploy/data/met/tika.conf 
  --failureDir /usr/local/coal-sds-deploy/data/failure/ 
  --daemonPort 9003 
  --daemonWait 2 
  --successDir /usr/local/coal-sds-deploy/data/archive/ 
  --actionIds DeleteDataFile

As you can see above, the ultimate --actionIds DeleteDataFile action, essentially states that we will delete the product upon successful ingest. This solves the issue for you. Note, additionally, I've added a few other flags. We can discuss more tomorrow. I would like someone to write up a document which covers the crawler component configuration and execution.