Islandora-Labs / islandora_solution_pack_oralhistories

Adds all required Fedora objects to allow users to ingest and retrieve Oral Histories (video/audio) files through the Islandora interface
GNU General Public License v3.0
13 stars 23 forks source link

Cannot parse WebVTT file produced by InqScribe on a Mac #92

Closed McFateM closed 5 years ago

McFateM commented 7 years ago

We have produced a few WebVTT files exported from InqScribe running on a Mac and the line endings (typically \r\n in my case) don't appear to be compatible with the module's parse_vtt( ) function. I am actively debugging this and attempting to make that parser more generic so I wanted to get this issue in the queue so that I have something to document changes against.

Natkeeran commented 7 years ago

@McFateM We have updated the parser recently. Can you please provide the last commit of your local repo.

Are the vtt files passing the validation here: https://quuz.org/webvtt/?

If you can attach some sample VTT files, that will help narrow down the issue as well.

McFateM commented 7 years ago

A ‘git log’ on my local reports this as the last change...

commit 88e90224821243b4da5d0bf169c9c6fd4e10e08c

Merge: 732701f fbc37b4

Author: kim pham kimpham54@users.noreply.github.com

Date: Mon Jun 5 15:19:11 2017 -0400

Merge pull request #77 from digitalutsc/issue_75

ensure to put CDATA for escape characters for INDEXMEDIATRACK

The latest .vtt I have does validate to the standard but I can’t share it here without first obfuscating some names (it’s not public yet) and I fear those edits might also alter the line endings.

I’m going to pull the latest 7.x code and see what it might do. I see there have been VTT-related changes committed lately. I’ll let you know how it goes.

As always, thanks for the quick response!

-Mark M.

From: Natkeeran notifications@github.com<mailto:notifications@github.com> Reply-To: digitalutsc/islandora_solution_pack_oralhistories reply@reply.github.com<mailto:reply@reply.github.com> Date: Tuesday, June 13, 2017 at 11:40 AM To: digitalutsc/islandora_solution_pack_oralhistories islandora_solution_pack_oralhistories@noreply.github.com<mailto:islandora_solution_pack_oralhistories@noreply.github.com> Cc: Mark McFate mcfatem@grinnell.edu<mailto:mcfatem@grinnell.edu>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [digitalutsc/islandora_solution_pack_oralhistories] Cannot parse WebVTT file produced by InqScribe on a Mac (#92)

@McFateMhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mcfatem&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=YGN3NwBaNtSHZOkD26iG9ygR3ECNB_lldC6Re119b4k&s=fpl4B3VZWoAVFBryFq7G8Xbm6n2_OrAw8dl7ejro9E8&e= We have updated the parser recently. Can you please provide the last commit of your local repo.

Are the vtt files passing the validation here: https://quuz.org/webvtt/https://urldefense.proofpoint.com/v2/url?u=https-3A__quuz.org_webvtt_&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=YGN3NwBaNtSHZOkD26iG9ygR3ECNB_lldC6Re119b4k&s=Pwjy7P-_twPrM5v7uih1Yzxck4DkCbXAExmTYwFsqnE&e=?

If you can attach some sample VTT files, that will help narrow down the issue as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_digitalutsc_islandora-5Fsolution-5Fpack-5Foralhistories_issues_92-23issuecomment-2D308176571&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=YGN3NwBaNtSHZOkD26iG9ygR3ECNB_lldC6Re119b4k&s=7mesqBd4MmdZB4Hba6fJKMQgUmNgCePec8UF_El-DK4&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIFIwU6YUlk47iVFySp7X5Sd5Ls0Uxp5ks5sDruagaJpZM4N4s9M&d=DwMFaQ&c=HUrdOLg_tCr0UMeDjWLBOM9lLDRpsndbROGxEKQRFzk&r=PQglHQe-EzyZqJOuOVcmU0OZ6bg-89msSPuqyNlQr28&m=YGN3NwBaNtSHZOkD26iG9ygR3ECNB_lldC6Re119b4k&s=J4YOdFEmi0AJ05fFeao7l0IxcZT5SLCSkjlks1bAd_c&e=.

McFateM commented 7 years ago

Correction: The last (first) entry in my git log is:

commit d04f0a78ad0889ae1201af7e51e9c720f27c978d Merge: d1f61a3 70a768d Author: Marcus Emmanuel Barnes MarcusBarnes@users.noreply.github.com Date: Wed Jun 7 09:54:40 2017 -0400

Merge pull request #76 from digitalutsc/issue_34

Redirects to the OH object after editing transcript XML through the manage datastream interface.
McFateM commented 7 years ago

So I pulled the latest code and I see the new VTT parser, but it appears to suffer from the same issues as before. Specifically I get this back when ingesting one of my 'valid' VTTs...

Notice: Undefined offset: 1 in VttConverter->fileContentToInternalFormat() (line 22 of /var/www/drupal7/sites/default/modules/contrib/islandora_solution_pack_oralhistories/includes/lib/VttConverter.php).
Notice: Undefined offset: 1 in VttConverter->fileContentToInternalFormat() (line 22 of /var/www/drupal7/sites/default/modules/contrib/islandora_solution_pack_oralhistories/includes/lib/VttConverter.php).

The problem, again, appears to be with line endings and perhaps a few other things that are 'optional', but still valid, in the VTT specification.

I'm going to introduce my stashed changes to what is now public function fileContentToInternalFormat($file_content) and see if I can work past this.

Thanks.

kstapelfeldt commented 6 years ago

@McFateM - we're exploring this issue in preparation for the 7x-1.10 compatible release of this module. Do you mind creating a pull request for us to review? Thank you!

MarcusBarnes commented 6 years ago

@McFateM Were you able to get this working? Are you able to provide your solution and possibly a sample of the valid VTT file that was not being handled adequately? Thank you in advance.

MarcusBarnes commented 5 years ago

@McFateM Some changes have been made to the VTT parser. Would you please confirm whether this issue still exits for you as of commit https://github.com/Islandora-Labs/islandora_solution_pack_oralhistories/commit/65812f4f5067d9ed927bbe78d9fc01902293fef4? Thanks in advance.

McFateM commented 5 years ago

Sorry @MarcusBarnes, I can't easily test this change because we stopped using the VTTs and found a way around this shortly after this issue was posted. So I don't have any VTT files to check this with, and both of our InqScribe licenses are in use by others for the foreseeable future.

MarcusBarnes commented 5 years ago

Thanks @McFateM for the update. I'll close the issue and we can reopen it if others encounter similar challenges going forward. I suspect that the change to the VTT parser may address the issue you previously reported (but this would need to be tested and confirmed).

timtomch commented 5 years ago

We ran into a similar issue when ingesting VTT files that had been created on Windows. After investigating, it looks like the parser only accepts transcript files with Unix style line endings.

Maybe it would be useful to update the parser so that it's more tolerant to non-Unix file formats?

timtomch commented 5 years ago

@MarcusBarnes do you want to reopen this, or should I create another enhancement request?

MarcusBarnes commented 5 years ago

@timtomch. Regarding your comment https://github.com/Islandora-Labs/islandora_solution_pack_oralhistories/issues/92#issuecomment-477769399 I'm inclined to make this a documentation issue - explicitly stating that VTT files should have Unix style line endings. Do you know what program was used to make the VTT files on Windows? For example, Notepad++ allows you to set the line endings. Would you be able to attach or send me an example VTT that failed for you? After confirming the behaviour (on a *nix environment), I can create an enhancement issue.

timtomch commented 5 years ago

Hi @MarcusBarnes. That's fine with me. You can use this file for testing. It's the "flying farmer" sample VTT file from the OH testing objects repo with the line endings converted to Windows style.