Editing VTT breaks INDEXTRANSCRIPT

Islandora-Labs / islandora_solution_pack_oralhistories

Adds all required Fedora objects to allow users to ingest and retrieve Oral Histories (video/audio) files through the Islandora interface

GNU General Public License v3.0

13 stars 23 forks source link

Editing VTT breaks INDEXTRANSCRIPT #151

Open DonRichards opened 4 years ago

DonRichards commented 4 years ago

When first uploaded the transcripts were correct. After editing the file it creates empty INDEXTRANSCRIPT files. Regenerating INDEXTRANSCRIPT also results in an (47 B) empty file.

Here is the transcript. I've tried to remove any special characters but it still seems broken.

MarcusBarnes commented 4 years ago

@DonRichards Thanks for reporting. Next step is for me to reproduce. Thanks for your patience while I try to work this task into my work schedule.

DonRichards commented 4 years ago

@MarcusBarnes Are you able to reproduce the error?

DonRichards commented 4 years ago

Ping @MarcusBarnes

MarcusBarnes commented 4 years ago

@DonRichards I've been away. I'll look into this this week. Would you please clarify how the VTT was edited?

DonRichards commented 4 years ago

Here's an example of a transcript that is failing. It works upon ingest but it fails when the editor is used. test.vtt.txt Screen Shot 2019-10-22 at 11 25 11 AM

DonRichards commented 4 years ago

One discripency you might have noticed between the screenshot and the vtt file is the closing </v> tag. I've tried it both ways with no luck.

DonRichards commented 4 years ago

I tried regenerating the INDEXTRANSCRIPT file but it creates a blank (47 B) file.

MarcusBarnes commented 4 years ago

@DonRichards Would you please confirm that WEBVTT was used for the transcript datastream when creating the initial oral history object? That is, you did not use transcript XML for the transcript datastream and then have WebVTT generated from the transcript XML?

MarcusBarnes commented 4 years ago

@DonRichards I was able to reproduce the behaviour you reported. I've labeled this as a bug. I'll note that the screenshot you shared in https://github.com/Islandora-Labs/islandora_solution_pack_oralhistories/issues/151#issuecomment-545018452 is not the default that ships with the solution pack, but that the issue is not related to that customization.

DonRichards commented 4 years ago

I uploaded a WebVTT file as the transcript when I ingested the object. Screen Shot 2019-10-22 at 1 03 14 PM Sorry, I wrote this and didn't click the green button. >:-|

MarcusBarnes commented 4 years ago

@DonRichards Thank you for confirming.

MarcusBarnes commented 4 years ago

@DonRichards For the example object above, please grab the text file below, remove the .txt extension (so that the file name and extension is unixlf.vtt), and then replace the TRANSCRIPT datastream with this file via the manage datastreams interface. Please do not otherwise open or edit the file.

unixlf.vtt.txt

After the TRANSCRIPT datastream has been replaced, click the regenerate operation for the INDEXTRANSCRIPT datastream.

Please let me know if you get the 47 B file (as per https://github.com/Islandora-Labs/islandora_solution_pack_oralhistories/issues/151#issuecomment-545022028) for the INDEXTRANSCRIPT datastream or not.

DonRichards commented 4 years ago

Doing those steps does fix the issue.

DonRichards commented 4 years ago

Is this to identify if ant \r \n characters are the issue?

MarcusBarnes commented 4 years ago

@DonRichards Correct. It seems that the parse_vtt function is breaking on the CR \r characters.

DonRichards commented 4 years ago

@MarcusBarnes I wonder why the module is generating a \r character instead of the typical \n. It should be easy enough to sanitize this.

DonRichards commented 4 years ago

@MarcusBarnes What was the steps you took to strip out those characters? I've ran a few tests (replacing \n with \r and tried \r\n) with no luck.

MarcusBarnes commented 4 years ago

@DonRichards I opened the sample VTT you provided in my text editor. My text editor (currently BBEdit) has the option of changing line ending characters from Windows (CRLF) to Unix (LF). If you're working on Windows, Notepad++ provides similar functionality. After changing the line ending settings, I saved.

DonRichards commented 4 years ago

I got it. Thanks. For the sake of prosperity for others if they come across this issue before it gets resolved I think the fix is easy enough. Steps to work around this issue

Download the VTT file (for example lets call it view.vtt)
Run this command against it $ dos2unix -ic view.vtt | xargs dos2unix
Replace data stream ( manage > datastreams > TRANSCRIPT > Replace > Upload)

DonRichards commented 4 years ago

The IDE solution works as well. Sorry, should have made that comment as well. Command line solutions avoid the IDE configuration craziness (like working with ATOM vs notepad++). I hope this helps.