VTUL / vtechworks

DSpace at Virginia Tech
http://vtechworks.lib.vt.edu
Other
6 stars 8 forks source link

Enable text indexing of WebVTT caption files #719

Closed keithgee closed 4 years ago

keithgee commented 4 years ago

This feature had been previously enabled, but seems to have been inadvertently lost during an upgrade.

alawvt commented 4 years ago

@keithgee, thank you very much for working on this. I have built this branch and can see the changed files in the VM at /dspace/config/local.cfg and /dspace/config/bitstream-formats.xml. I see the VTT format listed in the format registry, http://192.168.60.4/admin/format-registry.

I added https://vtechworks.lib.vt.edu/bitstream/handle/10919/88014/PESTEL.webm, https://vtechworks.lib.vt.edu/bitstream/handle/10919/88014/PESTEL.webm, a random text file, and a random HTML file to the item, http://192.168.60.4/handle/10919/88014. When I run sudo ./dspace filter-media -i 10919/88014 -v -f, it creates text files for the HTML and text files but it makes no attempt to process the vtt file.

root@vtechworksvm:/dspace/bin# sudo ./dspace filter-media -i 10919/88014 -v -f
The following MediaFilters are enabled: 
Full Filter Name: org.dspace.app.mediafilter.PoiWordFilter
org.dspace.app.mediafilter.PoiWordFilter
Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
Full Filter Name: org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter
org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter
Full Filter Name: org.dspace.app.mediafilter.PowerPointFilter
org.dspace.app.mediafilter.PowerPointFilter
Full Filter Name: org.dspace.app.mediafilter.HTMLFilter
org.dspace.app.mediafilter.HTMLFilter
Full Filter Name: org.dspace.app.mediafilter.ExcelFilter
org.dspace.app.mediafilter.ExcelFilter
Full Filter Name: org.dspace.app.mediafilter.PDFFilter
org.dspace.app.mediafilter.PDFFilter
PROCESSING: bitstream 3373b294-a3ff-4263-ae8e-a5b7fc0b976e (item: 10919/88014)
File: ElementsMeeting20180117.txt.txt
FILTERED: bitstream 3373b294-a3ff-4263-ae8e-a5b7fc0b976e (item: 10919/88014) and created 'ElementsMeeting20180117.txt.txt'
PROCESSING: bitstream 29e7ead6-4137-4c4a-9231-11ec36d4835f (item: 10919/88014)
File: Example_Domain.html.txt
FILTERED: bitstream 29e7ead6-4137-4c4a-9231-11ec36d4835f (item: 10919/88014) and created 'Example_Domain.html.txt'
keithgee commented 4 years ago

Thank you for testing, Anne. I'm trying to sort this out now.

alawvt commented 4 years ago

Thank you, Keith.

Anne

On Fri, Aug 21, 2020 at 4:13 PM keithgee notifications@github.com wrote:

Thank you for testing, Anne. I'm trying to sort this out now.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/VTUL/vtechworks/pull/719#issuecomment-678472632, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDO44DFFOXL5FNWQPEJWY3SB3INJANCNFSM4QDVUIAA .

--

Anne Lawrence

Repository Application Administrator

University Libraries (0434)

Newman Library, Room 420, Virginia Tech

560 Drillfield Drive

Blacksburg, VA 24061

(540) 231-9320

keithgee commented 4 years ago

An update - I was testing this afternoon and I made the mistake of uninstalling VirtualBox before reinstalling the new version; my goal was to start from scratch so that I'd have all the latest versions of software and would be able to reproduce the issue with a new VM instead of the old one I was using.

I ran into a problem with the install of VirtualBox. After some investigation, the problem appears to be with a system policy regarding permissions to install Kernel Extensions. I met with Desktop Services. They are rolling out JAMF in about two weeks and this is expected to fix the issue for me. They also mentioned that this may be related to Catalina - they had been recommending that people not upgrade to Catalina (though, I believe this computer may have shipped with Catalina).

So this is a temporary blocker. I may switch to working on the automated subtitles problem for a bit instead, unless there's another workaround that's easy before JAMF rolls out on my machine.

alawvt commented 4 years ago

@keithgee, thanks for the update. My computers are on High Sierra, so I can't help you with Catalina. I have not encountered this error with VirtualBox.

pmather commented 4 years ago

I'm on Mojave right now. (New) Kernel extensions have to be explicitly allowed in System Preferences during installation or when loaded. I understand that Catalina has made this even stricter in that kernel extensions must be approved and that only notarised extensions may be approved. This is all part and parcel of the System Integrity Protection/Gatekeeper features in macOS.

keithgee commented 4 years ago

I received a computer yesterday with JAMF installed and was able to install VirtualBox without issue, and verified the problem this morning. The latest commit on this branch fixes the issue by using the Name (HTML, Text, WebVTT caption file) from the media format registry instead of the Mime Type. VTT files are indexed.

PROCESSING: bitstream ef0b6b87-668e-4b5e-9eb2-c6f448587b32 (item: 10919/4)
File: PESTEL.vtt.txt
FILTERED: bitstream ef0b6b87-668e-4b5e-9eb2-c6f448587b32 (item: 10919/4) and created 'PESTEL.vtt.txt'
PROCESSING: bitstream cf7218cc-9d49-4e68-9e14-158cf6d8b55b (item: 10919/4)
File: junk.txt.txt
FILTERED: bitstream cf7218cc-9d49-4e68-9e14-158cf6d8b55b (item: 10919/4) and created 'junk.txt.txt'

Sorry, I should have tested better initially to make sure that the old configuration change worked with the newer version of the software. It didn't; this does. @alawvt Please squash my changes to hide my shameful mistakes before merging.

alawvt commented 4 years ago

I recreated this branch as webvtt with one commit and merged it. @keithgee, thank you very much for enabling webvtt indexing. It's great your new computer is up and running.