WebCuratorTool / webcurator

The root of the webcurator tool project, containing all modules needed to run a fully functional webcurator tool.
Apache License 2.0
2 stars 1 forks source link

Bug fixed #36

Closed leefrank9527 closed 3 years ago

leefrank9527 commented 3 years ago

1.Fixed the issue: Too many open files. 2.Fixed the issue: The warc files were not copyed to wayback. 3.Adjusted the process of harvest heartbeat to handle the job which doesn't exist on Crawler. 4.Added a shell script to build and start all components.

hannakoppelaar commented 3 years ago

"Review in access tool" works again, that's great!

I did notice that on the queue page the fields Run Time, Data Downloaded and URLs now remain empty when a crawl is running. This may be an obscure Hibernate issue. I remember having dealt with this before. This problem does not occur for me on master.

obrienben commented 3 years ago

I'm also seeing the empty crawl stats for running harvests

leefrank9527 commented 3 years ago

@hannakoppelaar @obrienben For the empty "Data Downloaded and URLs", I found it's caused by Heritrix Crawler. Sometimes, the Heritrix Crawler accepted a job, but did nothing. When the Heritrix Crawler could run a job successfully, the "Data Downloaded and URLs" would be reported normally. I'm trying to figure out what happened for the Heritrix Crawler.

obrienben commented 3 years ago

@leefrank9527 the behavior I saw when testing the problem - the crawls that I started had blank stats, but I left them for a while and then stopped the crawls and found that they completed successfully with downloaded data. Then after they moved into the harvested state, the stats (Data Downloaded and URLs etc) would display.

obrienben commented 3 years ago

Replaced by PR #39