blazegraph / database

Blazegraph High Performance Graph Database
GNU General Public License v2.0
872 stars 170 forks source link

n-quads is UTF-8, but Blazegraph only supports US-ASCII #206

Open jpmccu opened 2 years ago

jpmccu commented 2 years ago

According to the IANA record [1], n-quads is only supposed to be interpreted as UTF-8, but currently posting utf-8 data in n-quads results in it being interpreted as ASCII. You claim to support the appropriate charset for each format, but n-quads needs to honor utf-8.

Encoding considerations: 8bit The syntax of N-Quads is expressed over code points in Unicode. The encoding is always UTF-8. Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-F]

[1] https://www.iana.org/assignments/media-types/application/n-quads

thompsonbry commented 2 years ago

Jamie, that certainly looks like a bug. Can you work up a PR with a test and a fix? I can point you to the relevant parts of the code if you are unfamiliar with it.

Thanks, Bryan

On Wed, Aug 11, 2021 at 16:03 Jamie McCusker @.***> wrote:

According to the IANA record [1], n-quads is only supposed to be interpreted as UTF-8, but currently posting utf-8 data in n-quads results in it being interpreted as ASCII. You claim to support the appropriate charset for each format, but n-quads needs to honor utf-8.

Encoding considerations: 8bit The syntax of N-Quads is expressed over code points in Unicode. The encoding is always UTF-8. Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-F]

[1] https://www.iana.org/assignments/media-types/application/n-quads

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/206, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YDZWZCZ5CJNKRIGAMLT4L6TZANCNFSM5B7VXEIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

jpmccu commented 2 years ago

We've worked around it, and will be retiring support for Blazegraph with Whyis 2.0. We will be moving over to Fuseki, which is easier for us to extend and control. We've had ongoing production stability issues with Blazegraph, especially when we push multiple mutations per second. I haven't reported this because it's hard to reproduce and it seemed like Blazegraph was EOL.

We do have other projects that are using Blazegraph, so I'll ask around the lab and see if anyone wants to take this on.

On Thu, Aug 12, 2021 at 9:57 AM Bryan Thompson @.***> wrote:

Jamie, that certainly looks like a bug. Can you work up a PR with a test and a fix? I can point you to the relevant parts of the code if you are unfamiliar with it.

Thanks, Bryan

On Wed, Aug 11, 2021 at 16:03 Jamie McCusker @.***> wrote:

According to the IANA record [1], n-quads is only supposed to be interpreted as UTF-8, but currently posting utf-8 data in n-quads results in it being interpreted as ASCII. You claim to support the appropriate charset for each format, but n-quads needs to honor utf-8.

Encoding considerations: 8bit The syntax of N-Quads is expressed over code points in Unicode. The encoding is always UTF-8. Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-F]

[1] https://www.iana.org/assignments/media-types/application/n-quads

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/206, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AATW7YDZWZCZ5CJNKRIGAMLT4L6TZANCNFSM5B7VXEIQ

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/206#issuecomment-897661737, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEL7IOC4XDAXTRBWV43T4PHLVANCNFSM5B7VXEIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Jamie McCusker (she/they)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @. @.> http://tw.rpi.edu

nvbach91 commented 12 months ago

Adding -Dfile.encoding=UTF-8 -Dfile.client.encoding=UTF-8 -Dclient.encoding.override=UTF-8 did the trick