Natkeeran commented 7 years ago

What does this Pull Request do?

This PR addresses this issue: https://github.com/digitalutsc/islandora_solution_pack_oralhistories/issues/58. It saves the values as CDATA, thus ensuring that a valid xml element is formed.

How should this be tested?

Get the PR
Edit an xml datastream with content as described in the issue
Verify that datastream is saved and rendered back as expected
Verify the derivatives are created as expected

kimpham54 commented 7 years ago

@Natkeeran this is a workable solution, however I'm concerned that if an administrator views the xml datastream that they would find the CDATA elements confusing. How about using something like an html entity encoder? http://php.net/manual/en/function.html-entity-decode.php. Just a suggestion.

MarcusBarnes commented 7 years ago

@kimpham54 I'd say the approach that @Natkeeran has taken is preferred given that we can't make too many assumptions about what the kind of characters will be added to the transcript cues. For example, I could imagine the module being used for preserving technical lectures in computer science. I would hope that only more advanced users would be editing the datastream directly and thus would recognize CDATA sections. We could document this in order to make it clearer if confusion is a major concern.

kimpham54 commented 7 years ago

@MarcusBarnes @Natkeeran I think documenting is fine, CDATA is a sound approach.

However I am a little concerned because I can imagine a scenario where you may have multiple transcript editors. Some editors may want to use software to create transcripts, some may be editing the file through the Transcript UI interface. If this is happening simultaneously, or if this requires someone to export the transcript, edit it, then replace the datastream, we need to make sure the transcript being accessed will be consistent across the board.

When I tested the following scenario:

Upload an oral history with XML transcript
Edit the transcript through the interface
Download the TRANSCRIPT datastream
Edit the TRANSCRIPT, adding the following cues:

<cue>
    <speaker><![CDATA[Test Speaker one ]]></speaker>
    <start><![CDATA[2]]></start>
    <end><![CDATA[4]]></end>
    <transcript><![CDATA[ and there are potatoes to be looked after.]]></transcript>
    <translation_fr><![CDATA[et il y a des pommes de terre à soigner.]]></translation_fr>
    <translation_ch><![CDATA[有土豆要照顾。]]></translation_ch>
    <translation_hi><![CDATA[और आलू की देखभाल करने के बाद।]]></translation_hi>
  </cue>

  <cue>
    <speaker>Test Speaker 6</speaker>
    <start>6</start>
    <end>8</end>
    <transcript>and there are potatoes to be looked after.</transcript>
    <translation_fr>t il y a des pommes de terre à soigner.</translation_fr>
    <translation_ch>有土豆要照</translation_ch>
    <translation_hi>और आलू की देखभाल करने के बाद।</translation_hi>
  </cue>

Upload it back
Edit transcript through interface

Result: Before step 5, the cue without the CDATA remained without CDATA, so what I gather is that transcripts will only have CDATA elements applied if they're edited through the interface. Should we be concerned that not all transcripts will have CDATA elements? One option is to apply CDATA elements on ingest, not just if they were edited through the interface, then we'd make sure all transcripts ingested in the repository will look consistent. Either way, I'm fine if this behaviour is documented and am fine with merging this PR.

Natkeeran commented 7 years ago

@kimpham54 CDATA is a standard way to handle special characters in XML files. We should recommend that users use CDATA to put in values. Generally, XML libraries and tools will know how to handle CDATA.

If the user edits a transcript via the interface, all the xml tier values will have CDATA, i.e below:

<?xml version="1.0" encoding="utf-8"?>
<cues>
  <cue>
    <speaker><![CDATA[Different Speaker]]></speaker>
    <start><![CDATA[0.000]]></start>
    <end><![CDATA[30.0]]></end>
    <transcript><![CDATA[aasd&&sss]]></transcript>
    <translation><![CDATA[This is the annotation content.]]></translation>
  </cue>
  <cue>
    <speaker><![CDATA[Speaker2]]></speaker>
    <start><![CDATA[30.0]]></start>
    <end><![CDATA[60]]></end>
    <transcript><![CDATA[speaker 2 transcript]]></transcript>
    <translation><![CDATA[speakder 2 transcript translation]]></translation>
  </cue>
  <cue>
    <speaker><![CDATA[Speaker 3]]></speaker>
    <start><![CDATA[60.0]]></start>
    <end><![CDATA[105]]></end>
    <transcript><![CDATA[speaker 3 transcript]]></transcript>
    <translation><![CDATA[speadker 3 transcript translation]]></translation>
  </cue>
</cues>

Yes, if some files are edited via interface and some are not. There will be some inconsistency. But, I don't think this is a major issue, because the xml would still parsable in a standard compliant way.

@MarcusBarnes noted that we can document our approach, and if users pose specific issues, we can refine the solution.

MarcusBarnes commented 7 years ago

I suggest we merge. @kimpham54 Thumbs up or down?

kimpham54 commented 7 years ago

The solution works for me. Thanks

Islandora-Labs / islandora_solution_pack_oralhistories

Place values of the transcript xml as CDATA #60

What does this Pull Request do?

How should this be tested?