VTUL / vtechworks

DSpace at Virginia Tech
http://vtechworks.lib.vt.edu
Other
6 stars 8 forks source link

Force run filter-media to recreate new thumbnails for all items #685

Closed alawvt closed 4 years ago

alawvt commented 4 years ago

Continues Issues #663 and #683

Force run filter-media to recreate new thumbnails (but not new text index files) for all items. This will standardize the thumbnail size at the new maximum dimensions, 100px x 100px, and replace those thumbnails that were badly created with the old thumbnail filter (PDFs in landscape format and those created with 'Microsoft Print to PDF').

alawvt commented 4 years ago

I have started running

sudo -u vtechworks /dspace/bin/dspace filter-media -p "ImageMagick Image Thumbnail","ImageMagick PDF Thumbnail" -f -i 10919/5524`

or with logging

sudo -u vtechworks /dspace/bin/dspace filter-media -p "ImageMagick Image Thumbnail","ImageMagick PDF Thumbnail" -f -i 10919/23913 | tee thumbnails/10919_23913.txt

to force thumbnail creation, one top-level community at a time. It seems to generate thumbnails for ~25 items/minute for items with one file. (ETDs will be slower, since they often have multiple files per item.) Once done, I am checking the community to spot any missing thumbnails or quality concerns.

View in ascending title order, 100 per page, https://vtechworks.lib.vt.edu/discover?rpp=100&etal=0&scope=10919/72294&group_by=none&sort_by=dc.title_sort&order=asc

Done

College of Agriculture and Life Sciences (CALS) [8455]  # done, checked for duplicate items
College of Architecture and Urban Studies (CAUS) [1625] # done, checked for duplicate items
College of Engineering (COE) [3697] # done, checked for duplicate items
College of Liberal Arts and Human Sciences (CLAHS) [716] # done, checked for duplicate items
College of Natural Resources and Environment (CNRE) [1323] # done, checked for duplicate items
College of Science (COS) [2363] # done, checked for duplicate items
Destination Areas (DAs) and Strategic Growth Areas (SGAs) [1376] # done, checked for duplicate items
ETDs: Networked Digital Library of Theses and Dissertations [176] # done, checked
ETDs: Virginia Tech Electronic Theses and Dissertations [33753] # done, checked
Fralin Life Sciences Institute [548] # done, checked for duplicate items
Honors College [12] # done, checked
Institute for Creativity, Arts, and Technology (ICAT) [45] # done, checked
Institute for Critical Technology and Applied Science (ICTAS) [83] # done, checked
Institute for Society, Culture and Environment (ISCE) [49] # done, checked
Pamplin College of Business [704] # done, checked for duplicate items
Research Centers [502] # done, checked for duplicate items
Student Works [675]- # done, checked for duplicate items
University Administration and Governance [6362] # done, checked for duplicate items
University Libraries [1330] # done, checked for duplicate items
Virginia Cooperative Extension (VCE) [7337] # done, checked for duplicate items
Virginia-Maryland College of Veterinary Medicine (VMCVM) [508] # done, checked for duplicate items
Virginia Tech Carilion (VTC) [486] # done, checked
Virginia Tech Patents [619] # done, checked
Virginia Tech Transportation Institute (VTTI) [495] # done, checked for duplicate items
VTechWorks Archives [12541] # done, checked for duplicate items

I just spoke with @soumikgh about running the thumbnail generation on production. I will probably run larger collections overnight.

Speed measurements
2020-01-16 running 10919/11041 (dissertation collection) from ~5:30 pm to 9:30 am (16 hours) it processed 12,000 bitstreams=750 bitstreams/hour. 2029-01-18 5:30 pm had processed ~21,000 bitstreams / 48 hours = ~440 bitstreams/hour. 2020-01-19 5:00 pm finished, had processed 22,645 bitstreams

2020-01-19 5:30 pm started running 10919/9291 (theses collection) 2020-01-21 9:30 am (40 hours) it processed 20,000 bitstreams = 500 bitstreams/hour. 2020-01-22 9:30 am (64 hours) it processed ~26,000 bitstreams = ~410 bitstreams/hour. 2020-01-24 8:30 am (87 hours) it finished, had processed 27,381 bitstreams

alawvt commented 4 years ago

This is continued with 12 issues to investigate 352 suspected duplicates. There are also issues there to investigate corrupt and missing PDFs discovered during the thumbnail creation.

I am impressed that none of the thumbnail jobs quit, even those processing >25,000 thumbnails.