WebCuratorTool / webcurator

The root of the webcurator tool project, containing all modules needed to run a fully functional webcurator tool.
Apache License 2.0
4 stars 1 forks source link

wct 3.0.3 does not propagate the operatorContactUrl to Heritrix 3 #38

Closed vitezg closed 2 years ago

vitezg commented 3 years ago

We've set up WCT and set the operator contact URL in the profile, however this data does not seem to propagate to the heritrix job configuration. I've attached four screenshots. Any idea what the problem can be? screenshot screenshot1 screenshot2 screenshot3

obrienben commented 3 years ago

Hi @vitezg, you don't need to explicitly define the OPERATOR_CONTACT_URL placeholder in the User Agent Prefix field. It is automatically appended onto the user agent string. E.g. image

vitezg commented 3 years ago

WCT seems to sets up crawler-beans.cxml properly (first screenshot), however heritrix seems to get a different configuration (second screenshot). I could set the variables as you described, thanks a lot for that, still the issue seems to lie somewhere else.

screenshot3 screenshot2

obrienben commented 3 years ago

Hi @vitezg what version are you currently running where you are seeing this issue?

vitezg commented 3 years ago

It's 3.0.3, wct-binary-3.0.3.tar.gz as downloaded from https://github.com/WebCuratorTool/webcurator/releases Our heritrix is heritrix-3.4.0-20200518, will try to update it and get back to you.

vitezg commented 3 years ago

It's the same with Heritrix 3.4.0-20210621

hannakoppelaar commented 2 years ago

Hi @vitezg, my guess is that the profile as stored by WCT is not being sent to Heritrix at all at job creation time. Can you check the logging from webcurator-webapp and webcurator-harvest-agent-h3 right after the moment a WCT target instance is run? Maybe that will give us a clue as to what's going on

vitezg commented 2 years ago

Hello @hannakoppelaar , thanks a lot for asking for the logs, you have guided me to the solution :) The problem was that WCT was running with a different user id than heritrix, so it could not write to the heritrix job directory. Running both software with heritrix's account solved all the issues, and the first harvest job just finished!

hannakoppelaar commented 2 years ago

That's great news @vitezg! :)