calvez / xcoaitoolkit

Automatically exported from code.google.com/p/xcoaitoolkit
0 stars 0 forks source link

Diacritics not showing correctly #55

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Please reference issue 31 for back history.  Back then we were getting U
plus 4 numbers.  Now we are getting ?.  

Copied here is the content of an email chain.

I don’t know what is going on here then.   Months ago we had used a record
that had  Dvořák  in it and I used that as my gold standard.  When Shrey
was able to process that record through and showed me proof I felt pretty
good and let him close the issue.   Now that name is coming through as
Dvo??k.  I have no idea why.  Either the parameters Shrey used for that
successful test were different or something else is going on.

I will open a new issue.

From: Ranganathan, Sharmila 
Sent: Wednesday, March 24, 2010 11:09 AM
To: Cook, Randall
Cc: Kiraly, Peter; Bowen, Jennifer; Lindahl, David
Subject: RE: diacritics problem

I did not install OAI toolkit for demo site. Following are the parameters
used in convert script for demo site . It has utf8 and ISO5426 for char
conversion.

java -Xmx1024m -jar lib/OAIToolkit-0.6.4alpha.jar -convert -source marc
-destination_xml dest_xml -destination xml -error error -error_xml
error_xml -log log -log_detail -marc_schema schema/MARC21slim_rochester.xsd
-marc_encoding utf8 -char_conversion ISO5426 -split_size 10000
-translate_leader_bad_chars_to_zero -translate_nonleader_bad_chars_to_spaces

For small set, I used following parameters (same as above)
java -Xmx1024m -jar lib/OAIToolkit-0.6.3alpha.jar -convert -source marc
-destination_xml dest_xml -destination xml -error error -error_xml
error_xml -log log -log_detail -marc_schema schema/MARC21slim_rochester.xsd
-marc_encoding utf8 -char_conversion ISO5426 -split_size 10000
-translate_leader_bad_chars_to_zero -translate_nonleader_bad_chars_to_spaces

I am not able to look at this particular record’s predecessor in OAI
toolkit since GetRecord verb(to get 1 particular record using the
identifier) is not working. So I generally looked at few records in OAI
toolkit and found out that it has diacritics  issue. For example, following
record has few ? which I have highlighted in yellow.

           <record>
- <header>
  <identifier>oai:library.rochester.edu:URVoyager1/10043</identifier> 
  <datestamp>2003-10-24T16:49:06Z</datestamp> 
  <setSpec>bib</setSpec> 
  </header>
- <metadata>
- <oai:oai_marc xmlns:marc="http://www.loc.gov/MARC21/slim"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:oai="http://www.openarchives.org/OAI/1.1/oai_marc"
xsi:schemaLocation="http://www.openarchives.org/OAI/1.1/oai_marc
http://www.openarchives.org/OAI/1.1/oai_marc.xsd" status="c" type="j"
level="m" ctlType="" charEnc="a" encLvl="0" catForm="" lrRqrd="">
  <oai:fixfield id="001">"25266"</oai:fixfield> 
  <oai:fixfield id="003">"NRU"</oai:fixfield> 
  <oai:fixfield id="005">"20031024164906.0"</oai:fixfield> 
  <oai:fixfield id="007">"sd fsngnn "</oai:fixfield> 
  <oai:fixfield id="008">"890901s1989 enkmsn dhi lat d"</oai:fixfield> 
- <oai:varfield id="028" i1="0" i2="2">
  <oai:subfield label="a">CDGIM 019</oai:subfield> 
  <oai:subfield label="b">Gimell</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="035" i1="" i2="">
  <oai:subfield label="b">ocm</oai:subfield> 
  <oai:subfield label="a">20290897</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="035" i1="" i2="">
  <oai:subfield label="9">00874965</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="040" i1="" i2="">
  <oai:subfield label="a">RRR</oai:subfield> 
  <oai:subfield label="c">RRR</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="041" i1="0" i2="">
  <oai:subfield label="d">lat</oai:subfield> 
  <oai:subfield label="e">latengfregerita</oai:subfield> 
  <oai:subfield label="h">lat</oai:subfield> 
  <oai:subfield label="g">engfregerita</oai:subfield> 
  <oai:subfield label="h">eng</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="049" i1="" i2="">
  <oai:subfield label="a">[LISTEN.] [ROOM] RRRR</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="099" i1="" i2="">
  <oai:subfield label="a">RCD 36</oai:subfield> 
  <oai:subfield label="a">CD 957</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="100" i1="0" i2="">
  <oai:subfield label="a">Josquin,</oai:subfield> 
  <oai:subfield label="c">des Prez,</oai:subfield> 
  <oai:subfield label="d">d. 1521.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="240" i1="0" i2="0">
  <oai:subfield label="a">Missa L'Homme arm? super voces
musicales</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="245" i1="1" i2="2">
  <oai:subfield label="a">L'homme arm? masses</oai:subfield> 
  <oai:subfield label="h">[sound recording] /</oai:subfield> 
  <oai:subfield label="c">Josquin.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="260" i1="" i2="">
  <oai:subfield label="a">Oxford, England :</oai:subfield> 
  <oai:subfield label="b">Gimell,</oai:subfield> 
  <oai:subfield label="c">p1989.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="300" i1="" i2="">
  <oai:subfield label="a">1 sound disc :</oai:subfield> 
  <oai:subfield label="b">digital, stereo ;</oai:subfield> 
  <oai:subfield label="c">4 3/4 in.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="500" i1="" i2="">
  <oai:subfield label="a">Sung in Latin.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="500" i1="" i2="">
  <oai:subfield label="a">Compact disc.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="500" i1="" i2="">
  <oai:subfield label="a">Program notes in English by Peter Phillips, with
Italian, French, and German translations, and words in Latin, English,
Italian, French and German (23 p.) inserted in container.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="505" i1="0" i2="">
  <oai:subfield label="a">Anon chanson: L'homme arm? (0:37) -- Josquin des
Pr?s. Missa L'homme arm? super voces musicales (40:17) ; Missa L'homme arm?
sexti toni (33:00)</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="511" i1="0" i2="">
  <oai:subfield label="a">Tallis Scholars ; Peter Phillips,
conductor.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="518" i1="" i2="">
  <oai:subfield label="a">Recorded in the Church of Saint Peter and Saint
Paul, Salle, Norfolk, England.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="650" i1="" i2="0">
  <oai:subfield label="a">Masses, Unaccompanied.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="655" i1="" i2="7">
  <oai:subfield label="a">Classical Music.</oai:subfield> 
  <oai:subfield label="2">local</oai:subfield> 
  <oai:subfield label="5">NRU</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="700" i1="0" i2="2">
  <oai:subfield label="a">Josquin,</oai:subfield> 
  <oai:subfield label="c">des Prez,</oai:subfield> 
  <oai:subfield label="d">d. 1521.</oai:subfield> 
  <oai:subfield label="t">Missa L'Homme arm? sexti toni.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="700" i1="1" i2="">
  <oai:subfield label="a">Phillips, Peter,</oai:subfield> 
  <oai:subfield label="d">1953-</oai:subfield> 
  <oai:subfield label="4">cnd.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="710" i1="2" i2="">
  <oai:subfield label="a">Tallis Scholars.</oai:subfield> 
  <oai:subfield label="4">prf.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="730" i1="0" i2="2">
  <oai:subfield label="a">Homme arm?.</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="911" i1="" i2="">
  <oai:subfield label="a">p</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="966" i1="" i2="">
  <oai:subfield label="l">MMStck</oai:subfield> 
  <oai:subfield label="m">MMAUD</oai:subfield> 
  <oai:subfield label="s">RCD 36</oai:subfield> 
  <oai:subfield label="b">39087010701755</oai:subfield> 
  </oai:varfield>
- <oai:varfield id="966" i1="" i2="">
  <oai:subfield label="l">/Recrd</oai:subfield> 
  <oai:subfield label="m">SRECRD</oai:subfield> 
  <oai:subfield label="s">CD 957</oai:subfield> 
  <oai:subfield label="c">1</oai:subfield> 
  <oai:subfield label="b">39087011166743</oai:subfield> 
  </oai:varfield>
  </oai:oai_marc>
  </metadata>
  </record>

Sharmila

From: Cook, Randall 
Sent: Tuesday, March 23, 2010 6:19 PM
To: Kiraly, Peter; Bowen, Jennifer; Ranganathan, Sharmila; Lindahl, David
Subject: RE: diacritics problem

Sharmila,  Do you know what parameters you used for the OAI Toolkit?  Did
you use the same as you did with my small data set?

From: Király Péter [mailto:pkiraly@tesuji.eu] 
Sent: Tuesday, March 23, 2010 6:01 PM
To: Bowen, Jennifer; Ranganathan, Sharmila; Cook, Randall; Lindahl, David
Subject: Re: diacritics problem

Hi Jennifer,

it seems, that the bad diacritics come from MST, or OAI Toolkit.

The Vierzehn Kanons:
http://128.151.244.146:8080/MSTDemo/MARCToXCTransformation-Service/oaiRepository
?verb=GetRecord&identifier=oai:mstdemo.rochester.edu:MSTDemo/MARCToXCTransformat
ion/72047&metadataPrefix=xc

A Hungarian movie example:
http://128.151.244.146:8080/MSTDemo/MARCToXCTransformation-Service/oaiRepository
?verb=GetRecord&identifier=oai:mstdemo.rochester.edu:MSTDemo/MARCToXCTransformat
ion/366078&metadataPrefix=xc

Péter
----- Original Message ----- 
From: Bowen, Jennifer 
To: Kiraly, Peter 
Sent: Tuesday, March 23, 2010 5:51 PM
Subject: diacritics problem

http://128.151.244.69/drupal-6.16/node/527252 

Instead of displaying umlauts it is displaying “?”.

Jennifer

Original issue reported on code.google.com by rc...@library.rochester.edu on 24 Mar 2010 at 4:51

GoogleCodeExporter commented 9 years ago
Sometimes I cannot let things go and this was one of them…..thank you 
Sharmila for
bearing with me today and being my software operator (perhaps I should learn 
some of
required skills to process OAI records at some point).

Short Version:   There appears to be no problem and I think Shrey did something 
wrong
when he processed the demo site records.  This data set will need to be 
reprocessed
with the changes filtering through the system.

Details:  
1.  Using slightly different processing parameter (ones I copied in below), 
Sharmila
converted and loaded the Dvorak record, but this yielded same results with both 
sets
of parameters, but we should determine the best set to use and have them in the 
scripts.
2.  I then extracted the bib record for the “Missa L'Homme armé super voces 
musicales”
which we know had the ? issue.  Sharmila processed it and happily it converted
correctly (see below).

<oai:subfield label="a">Anon chanson: L'homme armé (0:37) -- Josquin des 
Prés. Missa
L'homme armé super voces musicales (40:17) ; Missa L'homme armé sexti toni 
(33:00)
</oai:subfield>

As opposed to

  <oai:subfield label="a">Anon chanson: L'homme arm? (0:37) -- Josquin des Pr?s.
Missa L'homme arm? super voces musicales (40:17) ; Missa L'homme arm? sexti toni
(33:00)</oai:subfield> 

Original comment by rc...@library.rochester.edu on 25 Mar 2010 at 8:18