apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.11k stars 650 forks source link

Support more locales in Apache Jena Docker image #1998

Open kinow opened 1 year ago

kinow commented 1 year ago

Version

main

Feature

The docker image provided with Apache Jena code uses a multi-stage build, where the last layer (used to run it) uses Alpine Linux.

In Skosmos we are upgrading the Docker setup to switch from stian's Jena image to the one included in Apache Jena code, and also to run CICD tests with the Apache Jena image.

We ran into an issue now where we are not able to use multiple locales in PHP due to the missing locale command. Looking at the locales available for Java and Jena [^1], I get:

/ # java Available

en_US_POSIX
en
en_US_#Latn
en_US

I believe we will be able to install extra locales in the Skosmos container, but perhaps that should be included in the Apache Jena Docker image as well, allowing users that rely on other locales (for ARQ query collation, for example) to use the container with the desired locale.

If that sounds like a good idea, we can send a pull request upstream with our solution for the locales.

As for implementation details, it could be i) a fixed list of common locales, ii) just English but with the necessary tools and configuration to easily add more locales, or iii) something more elaborate like trying to download all locales but enable selectively via some env var (no idea if that's actually doable). I think the simplest would be ii), and then in Skosmos we would add an extra step to install finnish, swedish, german, etc.

[^1] Compiled the program from this oracle docs page with -source 8 -target 8 inside the Alpine container

Are you interested in contributing a solution yourself?

Yes

osma commented 1 year ago

Thanks for opening the issue @kinow ! More specifically, the problem is that ARQ collation doesn't work with the Fuseki Docker image, apparently due to lack of locale support (though the underlying issue might be different or more complicated than just lack of locales). Copying my analysis from the relevant Skosmos issue:


I tested this with the example query from the Jena ARQ Collation documentation. I put this in collation.rq:

PREFIX arq: <http://jena.apache.org/ARQ/function#>
SELECT ?label WHERE {
    VALUES ?label { "tsahurin kieli"@fi "tšekin kieli"@fi "tulun kieli"@fi "töyhtöhyyppä"@fi }
}
ORDER BY arq:collation("fi", ?label)

then ran it against the Fuseki Docker container using rsparql (from Jena) and the result was this:

$ rsparql --query collation.rq --service http://localhost:9030/skosmos/sparql
-----------------------
| label               |
=======================
| "töyhtöhyyppä"@fi   |
| "tsahurin kieli"@fi |
| "tšekin kieli"@fi   |
| "tulun kieli"@fi    |
-----------------------

This is not the correct, locale-aware order; tsahurin kieli should be first and töyhtöhyyppä last. So I think that the Fuseki container configuration should be modified to enable support for locale-aware collation.

osma commented 1 year ago

I was able to install extra locales into the Fuseki image by modifying the Dockerfile like this:

--- a/dockerfiles/jena-fuseki2-docker/Dockerfile
+++ b/dockerfiles/jena-fuseki2-docker/Dockerfile
@@ -94,6 +94,9 @@ ARG JENA_GROUP=$JENA_USER
 ARG JENA_GID=1000
 ARG JENA_UID=1000

+# Install locales needed by ARQ collation
+RUN apk add --no-cache musl-locales
+
 # Run as this user
 # -H : no home directory
 # -D : no password
@@ -116,7 +119,8 @@ ENV \
     JAVA_OPTIONS="-Xmx2048m -Xms2048m"  \
     JENA_VERSION=${JENA_VERSION}        \
     FUSEKI_JAR="${FUSEKI_JAR}"          \
-    FUSEKI_DIR="${FUSEKI_DIR}"
+    FUSEKI_DIR="${FUSEKI_DIR}"         \
+    MUSL_LOCPATH=/usr/share/i18n/locales/musl

 EXPOSE 3030

locale -a within the container now works and shows lots of installed locales, including fi_FI. However, the above query that tests ARQ collation support still doesn't return the correct result. So the problem may be elsewhere, or there is some other piece missing from the container.

afs commented 1 year ago

It appears jena uses Locale.forLanguageTag (from NodeValueSort.compareTo) which returns the "best" Locale. It seems to default to the host/installation locale.

kinow commented 1 year ago

Hi @osma

locale -a within the container now works and shows lots of installed locales, including fi_FI. However, the above query that tests ARQ collation support still doesn't return the correct result. So the problem may be elsewhere, or there is some other piece missing from the container.

I reproduced what you did, and got locale -a to return a long list of locales.

Then I tried that Java program I used in my initial comment from the Oracle docs, running it in the same container where I got locale -a to work I get:

/fuseki $ /opt/java-minimal/bin/java Available

en_US_POSIX
en
en_US_#Latn
en_US

I think something is missing in the JVM configuration to be able to locate the locales, so that your ARQ query works as well.

kinow commented 1 year ago

Found an easier way to visualize the available locales in a JVM:

/opt/java-minimal/bin/java -XshowSettings -version
kinow commented 1 year ago

Had a look at other Oracle docs, and also at the OpenJDK code, but couldn't figure out how to tell the JVM to load the other system locales. Tested this command too

/opt/java-minimal/bin/java -Djava.locale.providers="HOST,SPI,CLDR,JRE" Available

But it still returned only the en* locales. Had a look at the Docker image docs & issues to see if maybe it was something intentional in the shipped JVM version to save up space, but found nothing related to enabling more locales.

kinow commented 1 year ago

Ha! It's the JVM shipped with the Docker base image for Jena.

Using the same container, I tried the following:

$ apk add openjdk11
$ java Available

nn
ar_JO
bg
kea
nds
zu
am_ET
fr_DZ
ti_ET
bo_CN
hsb
qu_EC
ta_SG
lv
en_NU
en_MS
zh_SG_#Hans
en_GG
en_JM
vo
kkj
sv_SE
sr_ME
es_BO
dz_BT
mer
sah
en_ZM
fr_ML
br
ha_NG
ar_SA
fa_AF
dsb_DE
sk
os_GE
ml
en_MT
en_LR
ar_TD
en_GH
en_IL
cs
sv
el
tzm_MA
af
sw_UG
ses_ML
smn
tk_TM
sr_ME_#Cyrl
ar_EG
dsb
lkt_US
vai_LR_#Latn
ji_001
yo_NG
se_NO
khq
sw_CD
vo_001
en_PW
pl_PL
fil_PH
it_VA
sr_CS
ne_IN
es_PH
es_ES
es_CO
bg_BG
ji
ar_EH
bs_BA_#Latn
en_VC
nds_DE
nb_SJ
es_US
agq
hsb_DE
en_US_POSIX
en_150
ar_SD
en_KN
ha_NE
pt_MO
ebu
ro_RO
zh__#Hans
lb_LU
sr_ME_#Latn
es_GT
so_KE
dje_NE
bas_CM
fr_PM
ar_KM
fr_MG
no_NO_NY
es_CL
mn
agq_CM
kam_KE
teo
tr_TR
eu
fa_IR
en_MO
wo
shi__#Tfng
en_BZ
sq_AL
ar_MR
es_DO
ru
twq_NE
az
nmg_CM
fa
kl_GL
en_NR
nd
kk
az__#Cyrl
en_MP
en_GD
tk
hy
shi__#Latn
en_BW
en_AU
en_CY
kab_DZ
kde_TZ
ta_MY
ti_ER
nus_SS
en_RW
nd_ZW
sv_FI
ksb
luo
lb
ne
en_IE
ln_CD
zh_SG
en_KI
nnh_CM
om_ET
no
ja_JP
my
ka
ar_IL
mgh
or_IN
fr_MF
shi
kl
en_SZ
rwk_TZ
zh
mgh_MZ
es_PE
saq
az__#Latn
ta
en_GB
lag
zh_HK_#Hant
ar_SY
ksf_CM
bo
kk_KZ
tt_RU
es_PA
om_KE
ar_PS
en_AS
fr_VU
bez
zh_TW
kln
fr_MC
kw
pt_MZ
fr_NE
vai__#Latn
ksb_TZ
ksh
ur_IN
ln
en_JE
gsw_CH
ln_CF
en_CX
luy_KE
pt
en_AT
gl
kkj_CM
sr__#Cyrl
yue_CN_#Hans
es_GQ
kn_IN
ar_YE
to
en_SX
ga
qu
ru_KZ
en_TZ
et
en_PR
mua
ko_KP
in
ps
sn
nl_SR
rof
en_BS
km
zgh
fr_NC
be
gv
es
dua
gd_GB
jgo
nl_BQ
fr_CM
gsw
uz_UZ_#Cyrl
pa_IN_#Guru
en_KE
guz
mfe
asa_TZ
teo_UG
ja
fr_SN
or
brx
fr_MA
pt_LU
fr_BL
en_NL
mgo_CM
ln_CG
te
ko_KR
mr_IN
ha
sl
el_CY
es_MX
lrc_IR
gsw_FR
es_HN
hu_HU
ff_SN
sbp
sq_MK
sr_BA_#Cyrl
fi
uz
bs__#Cyrl
et_EE
sr__#Latn
en_SS
sw
bo_IN
fy_NL
ar_OM
tr_CY
nmg
rm
en_MG
fr_BI
uz_UZ_#Latn
bn
dua_CM
de_IT
lrc_IQ
vai__#Vaii
kn
fr_TN
sr_RS
de_CH
bn_BD
nnh
fr_PF
gu
en_ZA
pt_GQ
vun_TZ
jmc_TZ
en_TV
lo
fr_FR
en_PN
en_MH
fr_BJ
zh__#Hant
cu_RU
zh_HK_#Hans
nl_NL
sah_RU
en_GY
ps_AF
bs__#Latn
ky
mas
dyo_SN
os
bs_BA_#Cyrl
nl_CW
ar_DZ
sk_SK
pt_CH
fr_GQ
ff_CM
am
en_NG
fr_CI
ki_KE
en_PK
zh_CN
en_LC
rw
brx_IN
wo_SN
iw
gv_IM
mk_MK
en_TT
dav
sl_SI
fr_HT
te_IN
nl_SX
lrc
ses
ce
fr_CG
fr_BE
jgo_CM
mt_MT
es_VE
mg
mr
mer_KE
ko
nds_NL
en_BM
nb_NO
ak
seh
kde
dz
kea_CV
mgo
vi_VN
en_VU
en_US
to_TO
mfe_MU
seh_MZ
fr_BF
pa__#Guru
it_SM
fr_YT
gu_IN
ii_CN
pa_PK_#Arab
ast
fr_RE
fi_FI
yue__#Hans
ca_FR
sr_BA_#Latn
bn_IN
fr_GP
pa
zgh_MA
fr_DJ
rn
tg
rwk
uk_UA
en_NF
fr_CH
hu
twq
ha_GH
sr_XK_#Cyrl
bm
ar_SS
en_GU
nl_AW
de_BE
en_AI
en_CM
xog_UG
cs_CZ
ca_ES
tr
cgg
rm_CH
nyn_UG
ru_MD
ms_MY
ta_LK
ksf
en_TO
cy
en_PG
fr_CF
pt_TL
fr
sq
tg_TJ
en_ER
qu_PE
sr_BA
es_PY
de
kok_IN
es_EC
lg_UG
zu_ZA
fr_TG
sr_XK_#Latn
en_PH
ig_NG
fr_GN
prg_001
cgg_UG
zh_MO_#Hans
ksh_DE
lg
ru_RU
se_FI
ff
en_DM
en_CK
sd
ar_MA
en_BI
ga_IE
en_AG
fr_TD
en_WS
fr_LU
ebu_KE
bem_ZM
xog
ewo_CM
fr_CD
so
rn_BI
en_NA
ar_ER
kab
ms
nus
sn_ZW
prg
iw_IL
ug
es_EA
th_TH_TH_#u-nu-thai
hi
fr_SC
ca_IT
lag_TZ
en_SL
teo_KE
ca_AD
no_NO
zh_MO_#Hant
en_SH
vai
qu_BO
haw_US
vi
fr_CA
sq_XK
dyo
de_LU
en_KY
mt
it_CH
de_DE
si_LK
luo_KE
en_DK
yav
so_DJ
eo
it_IT
lt_LT
kam
ar_SO
en_ZW
ro
eo_001
ee
en_UM
nn_NO
fr_MU
se_SE
pl
en_TK
en_SI
mua_CM
ur
uz__#Arab
vai_LR_#Vaii
saq_KE
se
pt_GW
lo_LA
chr
af_ZA
ar_LB
ms_SG
ee_TG
ln_AO
ff_GN
be_BY
yue__#Hant
in_ID
es_BZ
ar_AE
hr_HR
luy
as
rof_TZ
it
pt_CV
ks_IN
uk
my_MM
ur_PK
mn_MN
en_FM
da_DK
es_PR
wae_CH
mzn
en_BE
ii
tt
fr_WF
ru_BY
mzn_IR
naq
fo_DK
en_SG
ee_GH
ar_BH
kln_KE
tzm
fur
om
hi_IN
en_CH
asa
yo_BJ
fo_FO
ast_ES
fr_KM
bez_TZ
fr_MQ
en_SD
es_AR
en_MY
ja_JP_JP_#u-ca-japanese
es_SV
pt_BR
ml_IN
sbp_TZ
fil
en_FK
uz__#Cyrl
is_IS
yue_HK_#Hant
hy_AM
en_GM
en_DG
fo
ne_NP
pt_ST
hr
ak_GH
lt
uz_AF_#Arab
fur_IT
ta_IN
ccp
en_SE
fr_GF
lkt
zh_CN_#Hans
is
es_419
si
pt_AO
en_001
en
guz_KE
gsw_LI
ccp_BD
es_IC
ca
ru_KG
fr_MR
ar_TN
ks
zh_TW_#Hant
bm_ML
kw_GB
ug_CN
as_IN
es_BR
zh_HK
khq_ML
sw_KE
en_SB
rw_RW
chr_US
th_TH
shi_MA_#Tfng
ar_IQ
nyn
yue
jmc
en_MW
naq_NA
mk
en_IO
ar_QA
en_DE
pa__#Arab
en_CC
bs
ro_MD
en_FI
pt_PT
fy
az_AZ_#Cyrl
th
dav_KE
ckb_IQ
shi_MA_#Latn
es_CU
ar
en_SC
en_VI
haw
eu_ES
en_UG
dje
en_NZ
bas
es_UY
mas_KE
ru_UA
sg_CF
yav_CM
uz__#Latn
el_GR
sg
da_GL
en_FJ
de_LI
en_BB
km_KH
smn_FI
hr_BA
de_AT
ckb_IR
nl
lu_CD
ca_ES_VALENCIA
ar_001
so_SO
lv_LV
ckb
es_CR
fr_GA
ar_KW
ar_LY
sr
sr_RS_#Cyrl
bem
en_MU
da
wae
gl_ES
en_IM
az_AZ_#Latn
en_LS
ig
en_HK
en_GI
ce_RU
gd
en_CA
ka_GE
fr_SY
sw_TZ
fr_RW
so_ET
nl_BE
ar_DJ
mg_MG
cy_GB
en_VG
cu
os_RU
sr_RS_#Latn
en_TC
ky_KG
sv_AX
af_NA
vun
en_IN
lu
ki
yo
es_NI
nb
ff_MR
sd_PK
mas_TZ
ti
kok
ewo
ms_BN
ccp_IN
br_FR

The command java -XshowSettings -version also returned the complete list of locales.

kinow commented 1 year ago

Also tried a fresh alpine image, then downloaded a release from Eclipse Temurin from the same URL used by Jena build, https://github.com/adoptium/temurin17-binaries/releases/.

It worked and displayed all the locales in the system. Looking at the Jena build, I can see this part that could be responsible for the missing locales:

image

kinow commented 1 year ago

So the reason for having only the English locale might be the jlink command above, and this Oracle docs page I think solves the rest of the mystery

image

I believe we will have to think how to customize the jlink command to include either all locales or more locales. I ran out of time this morning to continue troubleshooting it but might have more time later this week :wave:

osma commented 1 year ago

Great detective work @kinow!!

osma commented 1 year ago

Reading the JDK documentation, I suppose just adding java.locales to JDEPS_EXTRA above could do the trick? I hope it won't increase the size of the image too much...