Closed alawvt closed 4 years ago
I have started running
sudo -u vtechworks /dspace/bin/dspace filter-media -p "ImageMagick Image Thumbnail","ImageMagick PDF Thumbnail" -f -i 10919/5524`
or with logging
sudo -u vtechworks /dspace/bin/dspace filter-media -p "ImageMagick Image Thumbnail","ImageMagick PDF Thumbnail" -f -i 10919/23913 | tee thumbnails/10919_23913.txt
to force thumbnail creation, one top-level community at a time. It seems to generate thumbnails for ~25 items/minute for items with one file. (ETDs will be slower, since they often have multiple files per item.) Once done, I am checking the community to spot any missing thumbnails or quality concerns.
View in ascending title order, 100 per page,
https://vtechworks.lib.vt.edu/discover?rpp=100&etal=0&scope=10919/72294&group_by=none&sort_by=dc.title_sort&order=asc
Done
College of Agriculture and Life Sciences (CALS) [8455] # done, checked for duplicate items
College of Architecture and Urban Studies (CAUS) [1625] # done, checked for duplicate items
College of Engineering (COE) [3697] # done, checked for duplicate items
College of Liberal Arts and Human Sciences (CLAHS) [716] # done, checked for duplicate items
College of Natural Resources and Environment (CNRE) [1323] # done, checked for duplicate items
College of Science (COS) [2363] # done, checked for duplicate items
Destination Areas (DAs) and Strategic Growth Areas (SGAs) [1376] # done, checked for duplicate items
ETDs: Networked Digital Library of Theses and Dissertations [176] # done, checked
ETDs: Virginia Tech Electronic Theses and Dissertations [33753] # done, checked
Fralin Life Sciences Institute [548] # done, checked for duplicate items
Honors College [12] # done, checked
Institute for Creativity, Arts, and Technology (ICAT) [45] # done, checked
Institute for Critical Technology and Applied Science (ICTAS) [83] # done, checked
Institute for Society, Culture and Environment (ISCE) [49] # done, checked
Pamplin College of Business [704] # done, checked for duplicate items
Research Centers [502] # done, checked for duplicate items
Student Works [675]- # done, checked for duplicate items
University Administration and Governance [6362] # done, checked for duplicate items
University Libraries [1330] # done, checked for duplicate items
Virginia Cooperative Extension (VCE) [7337] # done, checked for duplicate items
Virginia-Maryland College of Veterinary Medicine (VMCVM) [508] # done, checked for duplicate items
Virginia Tech Carilion (VTC) [486] # done, checked
Virginia Tech Patents [619] # done, checked
Virginia Tech Transportation Institute (VTTI) [495] # done, checked for duplicate items
VTechWorks Archives [12541] # done, checked for duplicate items
I just spoke with @soumikgh about running the thumbnail generation on production. I will probably run larger collections overnight.
Speed measurements
2020-01-16 running 10919/11041 (dissertation collection)
from ~5:30 pm to 9:30 am (16 hours) it processed 12,000 bitstreams=750 bitstreams/hour.
2029-01-18 5:30 pm had processed ~21,000 bitstreams / 48 hours = ~440 bitstreams/hour.
2020-01-19 5:00 pm finished, had processed 22,645 bitstreams
2020-01-19 5:30 pm started running 10919/9291 (theses collection) 2020-01-21 9:30 am (40 hours) it processed 20,000 bitstreams = 500 bitstreams/hour. 2020-01-22 9:30 am (64 hours) it processed ~26,000 bitstreams = ~410 bitstreams/hour. 2020-01-24 8:30 am (87 hours) it finished, had processed 27,381 bitstreams
This is continued with 12 issues to investigate 352 suspected duplicates. There are also issues there to investigate corrupt and missing PDFs discovered during the thumbnail creation.
I am impressed that none of the thumbnail jobs quit, even those processing >25,000 thumbnails.
Continues Issues #663 and #683
Force run filter-media to recreate new thumbnails (but not new text index files) for all items. This will standardize the thumbnail size at the new maximum dimensions, 100px x 100px, and replace those thumbnails that were badly created with the old thumbnail filter (PDFs in landscape format and those created with 'Microsoft Print to PDF').