buda-base / xmltoldmigration

App to migrate from TBRC XML files to BDRC RDF LD
Apache License 2.0
0 stars 2 forks source link

Outlines may have malformed descriptions #7

Closed xristy closed 7 years ago

xristy commented 7 years ago

In several outlines the is represented in the content of the element instead of the @type, for example in O4JW333:

<outline:description type="location">vol:1,ff.1r-311r(pp.1-621)</outline:description>
<outline:description type="authorship">
t. Sarvajñādeva (k); Vidyākaraprabha; Dharmākara (ka0; Dpal gyi lhun po; r. Vidyākaraprabha; Dpal brtsegs
</outline:description>
<outline:description>sde dge number 1</outline:description>
<outline:description>stog number 1</outline:description>
<outline:description>snar thang number 1</outline:description>
<outline:description>lha sa number 1</outline:description>
<outline:description>urga number 1</outline:description>
<outline:description type="toh">1</outline:description>
<outline:description type="sde dge number">1</outline:description>

this also illustrates duplication of the sde dge number.

eroux commented 7 years ago

Only 3 outlines seem to suffer from this problem, here are the fixed versions of these: fixed-outlines.zip can you quickly check them before I (or you) upload them to exist?

I figured it was easier to fix them directly than having tedious code to handle all the various cases like the spelling mistakes and cases where the description has a type like

<outline:description type="location">lha sa number 1</outline:description>

Note that I did not fix puzzling things like

<outline:description>snar thang number</outline:description>

(with no number), that's garbage data, but well, at least the interesting data will get transferred correctly...

xristy commented 7 years ago

Excellent! I'll upload these three. Thanks so much.

xristy commented 7 years ago

Corrections uploaded

xristy commented 7 years ago

O5JW1143 also has malformed descriptions - it was a clone of O4JW333

eroux commented 7 years ago

O5JW1143.xml.zip here's the fixed version

eroux commented 7 years ago

A few others I just spotted:

eroux commented 7 years ago

The fixed outlines, can you upload them?

fixed-outlines.zip

xristy commented 7 years ago

I've uploaded these.

I noticed that in a number of cases there were sde dge elements but no numbers which I assume will simply be filtered out via xml2ld

I also noted some section names in O1PD181215 that have some sort of invalid character next to a chinese character

eroux commented 7 years ago

Thanks for the upload! For the ignore:

<description type="sde dge number"></description>

is ignored, but

<description>sde dge number</description>

is not (hence part of the changes I made to these outlines)... I didn't really look for invalid characters, do you have a node ID in which it appears?

xristy commented 7 years ago

O1PD1812154CZ135987, O1PD1812154CZ136047, O1PD1812154CZ136352, etc The section names directly under the skabs gsum pa/ rgyu mtshan nyid theg pa'i skor/ (ka-go) section

eroux commented 7 years ago

indeed, it's also an u+fffd as in the other encoding problems...