biosciencedbc / rdf-pubmed

0 stars 0 forks source link

Apache Jenaのriotが指摘するRDFのシンタックスエラーを無くす #2

Open mitsuhashi opened 1 year ago

mitsuhashi commented 1 year ago

Apache Jenaのriotが指摘するRDFのシンタックスエラーを無くす。 riotの実行方法は以下の通り。

[yayamamo@rdfp03 bin]$ find /data/rdf_portal/data/rdf/ep/ -type f -print -exec ./riot --sink "{}" \;

TODO

RIOTの実行スクリプト

mitsuhashi@db01:~/yayamamo/riot$ cat riot.sh
RDFDIR=/mnt/nas05/togovar/public/virtuoso/pubmed/20230723
RIOT=./apache-jena-4.9.0-SNAPSHOT/bin/riot
export CLASSPATH=./apache-jena-4.9.0-SNAPSHOT/lib/

find $RDFDIR -type f -print -exec $RIOT --sink "{}" \;
#find $RDFDIR -type f -print -exec $RIOT

#$RIOT --validate --time --check $RDFDIR/pubmed23n0001.ttl |& grep -v WARN
#$RIOT --validate --time --check $RDFDIR/pubmed23n0001.ttl >& riot_20230725.log

mitsuhashi@db01:~/yayamamo/riot$
mitsuhashi commented 1 year ago

Unicode replacement character U+FFFD in string

riotの出力の一部

/mnt/nas05/togovar/public/virtuoso/pubmed/20230723/pubmed23n0798.ttl
13:42:16 WARN  riot            :: [line: 2213982, col: 20] Unicode replacement character U+FFFD in string
13:42:16 WARN  riot            :: [line: 2214060, col: 20] Unicode replacement character U+FFFD in string

ttlファイル該当箇所

PMID:[24973148](https://pubmed.ncbi.nlm.nih.gov/24973148/)
mitsuhashi@vs66:~$ cat /mnt/nas05/togovar/public/virtuoso/pubmed/20230723/pubmed23n0798.ttl | awk "2213982==NR && 2213982==NR { print }"
  dcterms:rights "� 2009 Asian Oceanian Association for the Study of Obesity . Published by Elsevier Ltd. All rights reserved.";
mitsuhashi@vs66:~$ cat /mnt/nas05/togovar/public/virtuoso/pubmed/20230723/pubmed23n0798.ttl | awk "2214060==NR && 2214060==NR { print }"
  dcterms:rights "� 2009 Asian Oceanian Association for the Study of Obesity . Published by Elsevier Ltd. All rights reserved.";
mitsuhashi@vs66:~$

XMLファイル該当箇所

XMLファイルの段階で文字化けしているので対応不可能。

rdf_portal@vs66:~/rdf_portal-rdf/work/rdf-pubmed_download/baseline$ zgrep "2009 Asian Oceanian Association for the Study of Obesity" pubmed23n0798.xml.gz
          <CopyrightInformation>� 2009 Asian Oceanian Association for the Study of Obesity . Published by Elsevier Ltd. All rights reserved.</CopyrightInformation>
          <CopyrightInformation>� 2009 Asian Oceanian Association for the Study of Obesity . Published by Elsevier Ltd. All rights reserved.</CopyrightInformation>
          <CopyrightInformation>� 2009 Asian Oceanian Association for the Study of Obesity . Published by Elsevier Ltd. All rights reserved.</CopyrightInformation>
          <CopyrightInformation>� 2009 Asian Oceanian Association for the Study of Obesity . Published by Elsevier Ltd. All rights reserved.</CopyrightInformation>

なお、Unicode replacement character U+FFFD in string 以外のWARNやERRORは出力されていない。

mitsuhashi@db01:~/yayamamo/riot$ head -10 riot_20230726.log
/mnt/nas05/togovar/public/virtuoso/pubmed/20230723/pubmed23n1420.ttl
/mnt/nas05/togovar/public/virtuoso/pubmed/20230723/pubmed23n1419.ttl
/mnt/nas05/togovar/public/virtuoso/pubmed/20230723/pubmed23n1418.ttl
09:40:20 WARN  riot            :: [line: 1426004, col: 65] Unicode replacement character U+FFFD in string
09:40:20 WARN  riot            :: [line: 1435297, col: 937] Unicode replacement character U+FFFD in string
09:40:20 WARN  riot            :: [line: 1435297, col: 938] Unicode replacement character U+FFFD in string
09:40:20 WARN  riot            :: [line: 1435297, col: 954] Unicode replacement character U+FFFD in string
09:40:20 WARN  riot            :: [line: 1435297, col: 955] Unicode replacement character U+FFFD in string
09:40:20 WARN  riot            :: [line: 1435297, col: 971] Unicode replacement character U+FFFD in string
09:40:20 WARN  riot            :: [line: 1435297, col: 972] Unicode replacement character U+FFFD in string
mitsuhashi@db01:~/yayamamo/riot$ grep -v mnt riot_20230726.log | grep -v "U+FFFD"
mitsuhashi@db01:~/yayamamo/riot$