TheLanguageArchive / Arbil

Other
1 stars 1 forks source link

Encoding issue in CMDI export of IMDI file on Windows if original contains non-ascii character #1

Closed twagoo closed 8 years ago

twagoo commented 8 years ago

On Windows (7) with Arbil 2.6.1089-stable: import the attached IMDI file, export as CMDI. A validation error will occur due to encoding issues in the generated CMDI file. Also, opening the file with e.g. Oxygen will result in an error or warning.

From Alex:

alekoe@M11404319:~/projects/being_corpman/ludy/arbilcmdi$ xxd 20110519b_Drama_RCE13.cmdi | grep Christian
0000980: 2043 6872 6973 7469 616e 2042 f66b 2c20   Christian B.k, 
alekoe@M11404319:~/projects/being_corpman/ludy/arbilcmdi$ xxd 20110519b_Drama_RCE13.imdi | grep Christian
0000280: 7920 4368 7269 7374 6961 6e20 42c3 b66b  y Christian B..k
alekoe@M11404319:~/projects/being_corpman/ludy/arbilcmdi$ xxd 20160606145949.imdi  | grep Christian
0000280: 7920 4368 7269 7374 6961 6e20 42c3 b66b  y Christian B..k
twagoo commented 8 years ago
<?xml version="1.0" encoding="UTF-8"?>
<METATRANSCRIPT xmlns="http://www.mpi.nl/IMDI/Schema/IMDI"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                Date="2016-04-07"
                FormatId="IMDI 3.03"
                Originator="Arbil.2.6.1075:"
                Type="SESSION"
                Version="0"
                xsi:schemaLocation="http://www.mpi.nl/IMDI/Schema/IMDI ./IMDI_3.0.xsd">
  <Session>
      <Name>20110519b_Drama_RCE13</Name>
      <Title>20110519b_Drama_RCE13</Title>
      <Date>2011-05-19</Date>
      <Description LanguageId="" Link="">Participants just attended a performance by Christian Bök, an artist doing experimental writing and reading of sound poetry. The event was organized by the Humanities Research Centre and followed by a reception. The recording takes place after the reception is finished and very few people are left in the room. The researcher goes up to a small group of people while they are already engaged in discussing.</Description>
      <MDGroup>
         <Location>
            <Continent Link="http://www.mpi.nl/IMDI/Schema/Continents.xml"
                       Type="ClosedVocabulary">Europe</Continent>
            <Country Link="http://www.mpi.nl/IMDI/Schema/Countries.xml"
                     Type="OpenVocabulary">United Kingdom</Country>
            <Region/>
            <Address>York campus</Address>
         </Location>
         <Project>
            <Name>Human Sociality and Systems of Language Use</Name>
            <Title>Human Sociality and Systems of Language Use</Title>
            <Id>HSSLU</Id>
            <Contact>
               <Name>N.J. Enfield</Name>
               <Address/>
               <Email/>
               <Organisation>Max Planck Institute for Psycholinguistics</Organisation>
            </Contact>
            <Description LanguageId="" Link="">HSSLU: a project funded by a European Research Council Starting Independent Researcher Grant from January 2010 to December 2014, awarded to Nick J. Enfield (at the Language &amp; Cognition department of the Max Planck Institute for Psycholinguistics).</Description>
         </Project>
         <Keys>
      </Keys>
         <Content>
            <Genre Link="http://www.mpi.nl/IMDI/Schema/Content-Genre.xml"
                   Type="OpenVocabulary">Discourse</Genre>
            <SubGenre Link="http://www.mpi.nl/IMDI/Schema/Content-SubGenre.xml"
                      Type="OpenVocabularyList">Conversation</SubGenre>
            <Task Link="http://www.mpi.nl/IMDI/Schema/Content-Task.xml"
                  Type="OpenVocabulary"/>
            <Modalities Link="http://www.mpi.nl/IMDI/Schema/Content-Modalities.xml"
                        Type="OpenVocabularyList"/>
            <Subject Link="http://www.mpi.nl/IMDI/Schema/Content-Subject.xml"
                     Type="OpenVocabularyList"/>
            <CommunicationContext>
               <Interactivity Link="http://www.mpi.nl/IMDI/Schema/Content-Interactivity.xml"
                              Type="ClosedVocabulary">interactive</Interactivity>
               <PlanningType Link="http://www.mpi.nl/IMDI/Schema/Content-PlanningType.xml"
                             Type="ClosedVocabulary">spontaneous</PlanningType>
               <Involvement Link="http://www.mpi.nl/IMDI/Schema/Content-Involvement.xml"
                            Type="ClosedVocabulary">no-observer</Involvement>
               <SocialContext Link="http://www.mpi.nl/IMDI/Schema/Content-SocialContext.xml"
                              Type="ClosedVocabulary"/>
               <EventStructure Link="http://www.mpi.nl/IMDI/Schema/Content-EventStructure.xml"
                               Type="ClosedVocabulary">Conversation</EventStructure>
               <Channel Link="http://www.mpi.nl/IMDI/Schema/Content-Channel.xml"
                        Type="ClosedVocabulary">Face to Face</Channel>
            </CommunicationContext>
            <Languages>
               <Description LanguageId="ISO639-3:eng" Link=""/>
               <Language>
                  <Id>ISO639-3:eng</Id>
                  <Name Link="http://www.mpi.nl/IMDI/Schema/MPI-Languages.xml"
                        Type="OpenVocabulary">English</Name>
                  <Dominant Link="http://www.mpi.nl/IMDI/Schema/Boolean.xml"
                            Type="ClosedVocabulary">true</Dominant>
                  <SourceLanguage Link="http://www.mpi.nl/IMDI/Schema/Boolean.xml"
                                  Type="ClosedVocabulary">Unspecified</SourceLanguage>
                  <TargetLanguage Link="http://www.mpi.nl/IMDI/Schema/Boolean.xml"
                                  Type="ClosedVocabulary">Unspecified</TargetLanguage>
                  <Description LanguageId="" Link=""/>
               </Language>
            </Languages>
            <Keys>
        </Keys>
            <Description LanguageId="" Link=""/>
         </Content>
         <Actors>

            <Actor>
               <Role Link="http://www.mpi.nl/IMDI/Schema/Actor-Role.xml"
                     Type="OpenVocabularyList">Researcher,Recorder</Role>
               <Name>Giovanni Rossi</Name>
               <FullName>Giovanni Rossi</FullName>
               <Code/>
               <FamilySocialRole Link="http://www.mpi.nl/IMDI/Schema/Actor-FamilySocialRole.xml"
                                 Type="OpenVocabularyList"/>
               <Languages>
                  <Description LanguageId="" Link=""/>
               </Languages>
               <EthnicGroup/>
               <Age>Unspecified</Age>
               <BirthDate>Unspecified</BirthDate>
               <Sex Link="http://www.mpi.nl/IMDI/Schema/Actor-Sex.xml"
                    Type="ClosedVocabulary"/>
               <Education>Unspecified</Education>
               <Anonymized Link="http://www.mpi.nl/IMDI/Schema/Boolean.xml"
                           Type="ClosedVocabulary">Unspecified</Anonymized>
               <Contact>
                  <Name>Giovanni Rossi</Name>
                  <Address/>
                  <Email>giorossimail@gmail.com</Email>
                  <Organisation/>
               </Contact>
               <Keys>
        </Keys>
               <Description LanguageId="" Link=""/>
            </Actor>
         </Actors>
      </MDGroup>
      <Resources>

         <MediaFile>
            <ResourceLink>file:/X:/digiteam/lamus/20110519b_Drama_RCE13.mpeg</ResourceLink>
            <Type Link="http://www.mpi.nl/IMDI/Schema/MediaFile-Type.xml"
                  Type="ClosedVocabulary">video</Type>
            <Format Link="http://www.mpi.nl/IMDI/Schema/MediaFile-Format.xml"
                    Type="OpenVocabulary">video/x-mpeg2</Format>
            <Size>5409355KB</Size>
            <Quality Link="http://www.mpi.nl/IMDI/Schema/Quality.xml"
                     Type="ClosedVocabulary">Unspecified</Quality>
            <RecordingConditions/>
            <TimePosition>
               <Start>Unspecified</Start>
               <End>Unspecified</End>
            </TimePosition>
            <Access>
               <Availability/>
               <Date/>
               <Owner/>
               <Publisher/>
               <Contact>
                  <Name/>
                  <Address/>
                  <Email/>
                  <Organisation/>
               </Contact>
               <Description LanguageId="" Link=""/>
            </Access>
            <Description LanguageId="" Link=""/>
            <Keys>
        </Keys>
         </MediaFile>
         <MediaFile>
            <ResourceLink>file:/X:/digiteam/lamus/20110519b_Drama_RCE13.mpg</ResourceLink>
            <Type Link="http://www.mpi.nl/IMDI/Schema/MediaFile-Type.xml"
                  Type="ClosedVocabulary">video</Type>
            <Format Link="http://www.mpi.nl/IMDI/Schema/MediaFile-Format.xml"
                    Type="OpenVocabulary">video/x-mpeg1</Format>
            <Size>286438KB</Size>
            <Quality Link="http://www.mpi.nl/IMDI/Schema/Quality.xml"
                     Type="ClosedVocabulary">Unspecified</Quality>
            <RecordingConditions/>
            <TimePosition>
               <Start>Unspecified</Start>
               <End>Unspecified</End>
            </TimePosition>
            <Access>
               <Availability/>
               <Date/>
               <Owner/>
               <Publisher/>
               <Contact>
                  <Name/>
                  <Address/>
                  <Email/>
                  <Organisation/>
               </Contact>
               <Description LanguageId="" Link=""/>
            </Access>
            <Description LanguageId="" Link=""/>
            <Keys>
        </Keys>
         </MediaFile>
         <MediaFile>
            <ResourceLink>file:/X:/digiteam/lamus/20110519b_Drama_RCE13.wav</ResourceLink>
            <Type Link="http://www.mpi.nl/IMDI/Schema/MediaFile-Type.xml"
                  Type="ClosedVocabulary">audio</Type>
            <Format Link="http://www.mpi.nl/IMDI/Schema/MediaFile-Format.xml"
                    Type="OpenVocabulary">audio/x-wav</Format>
            <Size>230220KB</Size>
            <Quality Link="http://www.mpi.nl/IMDI/Schema/Quality.xml"
                     Type="ClosedVocabulary">Unspecified</Quality>
            <RecordingConditions/>
            <TimePosition>
               <Start>Unspecified</Start>
               <End>Unspecified</End>
            </TimePosition>
            <Access>
               <Availability/>
               <Date/>
               <Owner/>
               <Publisher/>
               <Contact>
                  <Name/>
                  <Address/>
                  <Email/>
                  <Organisation/>
               </Contact>
               <Description LanguageId="" Link=""/>
            </Access>
            <Description LanguageId="" Link=""/>
            <Keys>
        </Keys>
         </MediaFile>
         <MediaFile>
            <ResourceLink>file:/X:/digiteam/lamus/20110519b_Drama_RCE13_720.mp4</ResourceLink>
            <Type Link="http://www.mpi.nl/IMDI/Schema/MediaFile-Type.xml"
                  Type="ClosedVocabulary">video</Type>
            <Format Link="http://www.mpi.nl/IMDI/Schema/MediaFile-Format.xml"
                    Type="OpenVocabulary">video/mp4</Format>
            <Size>318865KB</Size>
            <Quality Link="http://www.mpi.nl/IMDI/Schema/Quality.xml"
                     Type="ClosedVocabulary">Unspecified</Quality>
            <RecordingConditions/>
            <TimePosition>
               <Start>Unspecified</Start>
               <End>Unspecified</End>
            </TimePosition>
            <Access>
               <Availability/>
               <Date/>
               <Owner/>
               <Publisher/>
               <Contact>
                  <Name/>
                  <Address/>
                  <Email/>
                  <Organisation/>
               </Contact>
               <Description LanguageId="" Link=""/>
            </Access>
            <Description LanguageId="" Link=""/>
            <Keys>
        </Keys>
         </MediaFile>
      </Resources>
      <References>
    </References>
  </Session>
</METATRANSCRIPT>
twagoo commented 8 years ago

Possibly updating saxon dependency to latest version fixes it (e62f1c2a85d0f43fb8ee4d1b1e9cdac57bdb78f3)

But on first test on Windows, exporting with CMDI conversion enabled seems to hang

twagoo commented 8 years ago

Hanging was due to bug in updated translation service. Updating saxon-he to 9.7.0-6 has not fixed the issue...

twagoo commented 8 years ago

Fixed in 3acfece48ab5be337cf325045372671019240c36 by using an outputstreamwriter with explicit UTF-8 encoding