WebCuratorTool / webcurator

The root of the webcurator tool project, containing all modules needed to run a fully functional webcurator tool.
Apache License 2.0
2 stars 1 forks source link

V3.2.0: screenshot merged #78

Closed leefrank9527 closed 4 months ago

leefrank9527 commented 1 year ago

Had rebased the screenshot to 3.1.3, and had an integrated testing with Pywb for the screenshot of havested contents.

leefrank9527 commented 1 year ago

@hannakoppelaar @obrienben I updated the screenshot feature with: 1) Migrated the screenshot tool of the java version into the Store component as a builtin screenshot tool. 2) Refactored some of the configuration items in application.properties of the Store component. 3) Added or changed the test cases related to the screenshot feature.

hannakoppelaar commented 1 year ago

@leefrank9527 @obrienben

The pywb banner is visible in the archive screenshot. Shouldn't that be removed or is this how it's supposed to work?

Screenshot 2023-03-15 at 12-30-55 WEB CURATOR TOOL Target Instances

Update @leefrank9527, @obrienben: I was using pywb version 2.6.7 when I tested this.

leefrank9527 commented 1 year ago

@hannakoppelaar @obrienben : I've integrated the screenshot with multiple *waybacks, and updated the v3.1.3/screenshot-merged branch. The changes are:

  1. Integrated with pywb and openwayback and pywb. Tested the index and screenshot features with pywb 2.7.3, pywb 2.6.7, pywb 2.3.0 and openwayback 2.4.0.
  2. Get rid of the banner from the replay of pywb with a new way. A template html is used to request the pure payload of harvested site, and the way doesn't depend on the structure of the full html from pywb.
  3. For the openwayback, the screenshot event will wait for the indexing until the index of seed urls are available.
  4. Simplified the timestamp extractor. Extracting the timestamp of urls from WARC files instead of from CDX files.
  5. application.properties of store component.
  6. Fixed some tiny issues and added .

One point have to mention is the way to put WARC files into pywb. Different from the way for OpenWayBack, the command wb-manager is used to deposit the WARC files to pywb. The way is recommended by pywb group, and the indexing is processed and loaded synchronously. It's different from that revealed in the document: https://webcuratortool.readthedocs.io/en/latest/guides/wayback-integration-guide.html

hannakoppelaar commented 1 year ago

Maybe it's a good idea to add the necessary changes to https://webcuratortool.readthedocs.io/en/latest/guides/wayback-integration-guide.html#pywb-configuration and https://webcuratortool.readthedocs.io/en/latest/guides/system-administrator-guide.html#digital-asset-store-application-properties in this PR?

hannakoppelaar commented 1 year ago

It looks okay, I did find a minor issue: even though the application manages to create screenshots, in the log file it complains that it has an "Unrecognised argument 'native'. Cannot generate screenshot." It seems that the switch statement in SeleniumScreenshotCapture.java is missing a 'native' case?

leefrank9527 commented 1 year ago

It looks okay, I did find a minor issue: even though the application manages to create screenshots, in the log file it complains that it has an "Unrecognised argument 'native'. Cannot generate screenshot." It seems that the switch statement in SeleniumScreenshotCapture.java is missing a 'native' case?

@hannakoppelaar @obrienben Has fixed the issue and update the docs related to screenshot.