HOST-Oman / scribus

Project for adding complex text layout to Scribus DTP program
Other
36 stars 21 forks source link

Default_Ignorable_Code_Point in spell checking and hyphenation? #215

Closed sommerluk closed 7 years ago

sommerluk commented 8 years ago

Some characters in Unicode have the property Default_Ignorable_Code_Point. This includes characters like the soft hyphen U+00AD and the ZWNJ zero width non-joiner U+200C. Most format characters have this property. With CTL and OpenType support available in Scribus, the usage of these characters will increase.

The question is how to treat these characters in spell checking and hyphenation.

Spell checking

Currently, the spell checking code seems to read words with Default_Ignorable_Code_Point within, but it applied the correction only to the characters before the first Default_Ignorable_Code_Point. Result: The second part of the word is duplicated. This is always wrong.

Possible solution:

Hyphenation

Currently, the automatic hyphenation does not work as expected when characters like ZWNJ are present. Example: The German word “Auflage” should be hyphenated “Auf-la-ge”, and this works correctly in Scribus when no ZWNJ is there. But when there is a ZWNJ between “Auf” and “lage” then the first hyphenation point is not found. It’s only “Aufla-ge”.

Possible solution:

Fahad-Alsaidi commented 8 years ago

In spell check we use icu word iterator. Thus, I think this should be done in icu library.

Fahad-Alsaidi commented 7 years ago

Hi @sommerluk could you please provide a sample file for the problem?

sommerluk commented 7 years ago

sample files.zip

Fahad-Alsaidi commented 7 years ago

by 56b25215480ecbb3bdf90fa1159a60321033a5b4 & 7db0ad36356a458aca064e44bd35c9ee38cffb2e should Spell checking part be fixed. @sommerluk please test.

Fahad-Alsaidi commented 7 years ago

@sommerluk if things is working fine with you, then we done here because I prefer the first solution until some body brave enough to implement https://github.com/HOST-Oman/scribus/issues/145.

Fahad-Alsaidi commented 7 years ago

I am closing this now, if you have a problem please fill a new bug report.

sommerluk commented 7 years ago

Sorry for answering late. I did not forget it, but building scribus-ctl from source did not work for me and I could not figure out how to run the AppImage (openSUSE Leap within VirtualBox) either:

realPath called with a relative path './share/pixmaps/', please fix
realPath called with a relative path './share/icons/', please fix
pci id for fd 11: 80ee:beef, driver (null)
libGL error: core dri or dri2 extension not found
libGL error: failed to load driver: vboxvideo
pathForIcon: Unable to load icon ././/share/scribus/icons/1_5_1/AppIcon.png: File not found
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
ImportError: No module named site
Scribus Crash
-------------
Scribus crashes due to Signal #11
Speicherzugriffsfehler

So, I could not actually test it. If there is an easy way to get this working, I can make some testing.

However, two observations about the patch:

  1. The list of the two characters (ZWNJ and SOFT HYPHEN) is duplicated at two different places. This might be dangerous because in the future, somebody who does not know that these two lists must be kept synchronized, could change only one of these two lists while leaving the other one unchanged, and this would lead to unexpected results.
  2. The list contains only two characters. I’m confident that this is enough for German typesetting (at least at 99,9%) and probably all Latin scripts. I’m not so sure for other scripts.

The complete list of Default_Ignorable_Code_Point in Unicode 9 from http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt is:

# ================================================

# Derived Property: Default_Ignorable_Code_Point
#  Generated from
#    Other_Default_Ignorable_Code_Point
#  + Cf (Format characters)
#  + Variation_Selector
#  - White_Space
#  - FFF9..FFFB (Annotation Characters)
#  - 0600..0605, 06DD, 070F, 08E2, 110BD (exceptional Cf characters that should be visible)

00AD          ; Default_Ignorable_Code_Point # Cf       SOFT HYPHEN
034F          ; Default_Ignorable_Code_Point # Mn       COMBINING GRAPHEME JOINER
061C          ; Default_Ignorable_Code_Point # Cf       ARABIC LETTER MARK
115F..1160    ; Default_Ignorable_Code_Point # Lo   [2] HANGUL CHOSEONG FILLER..HANGUL JUNGSEONG FILLER
17B4..17B5    ; Default_Ignorable_Code_Point # Mn   [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA
180B..180D    ; Default_Ignorable_Code_Point # Mn   [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
180E          ; Default_Ignorable_Code_Point # Cf       MONGOLIAN VOWEL SEPARATOR
200B..200F    ; Default_Ignorable_Code_Point # Cf   [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK
202A..202E    ; Default_Ignorable_Code_Point # Cf   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
2060..2064    ; Default_Ignorable_Code_Point # Cf   [5] WORD JOINER..INVISIBLE PLUS
2065          ; Default_Ignorable_Code_Point # Cn       <reserved-2065>
2066..206F    ; Default_Ignorable_Code_Point # Cf  [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
3164          ; Default_Ignorable_Code_Point # Lo       HANGUL FILLER
FE00..FE0F    ; Default_Ignorable_Code_Point # Mn  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16
FEFF          ; Default_Ignorable_Code_Point # Cf       ZERO WIDTH NO-BREAK SPACE
FFA0          ; Default_Ignorable_Code_Point # Lo       HALFWIDTH HANGUL FILLER
FFF0..FFF8    ; Default_Ignorable_Code_Point # Cn   [9] <reserved-FFF0>..<reserved-FFF8>
1BCA0..1BCA3  ; Default_Ignorable_Code_Point # Cf   [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND FORMAT UP STEP
1D173..1D17A  ; Default_Ignorable_Code_Point # Cf   [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE
E0000         ; Default_Ignorable_Code_Point # Cn       <reserved-E0000>
E0001         ; Default_Ignorable_Code_Point # Cf       LANGUAGE TAG
E0002..E001F  ; Default_Ignorable_Code_Point # Cn  [30] <reserved-E0002>..<reserved-E001F>
E0020..E007F  ; Default_Ignorable_Code_Point # Cf  [96] TAG SPACE..CANCEL TAG
E0080..E00FF  ; Default_Ignorable_Code_Point # Cn [128] <reserved-E0080>..<reserved-E00FF>
E0100..E01EF  ; Default_Ignorable_Code_Point # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
E01F0..E0FFF  ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>

# Total code points: 4173

# ================================================

I do not have the knowledge to tell, but maybe ZWJ and the script specific characters (Arabic, Hangul, Khmer, Mongolian) might be interesting. Would it be overkill to simply use the whole list?

Fahad-Alsaidi commented 7 years ago

@sommerluk could you try this solution for appimage?

I reopen this for now.

sommerluk commented 7 years ago

It does not work. But the error message has changed:

:~> ./scribus-git217b3eb-glibc2.14-x86-64.appimage
realPath called with a relative path './share/pixmaps/', please fix
realPath called with a relative path './share/icons/', please fix
pci id for fd 11: 80ee:beef, driver (null)
libGL error: core dri or dri2 extension not found
libGL error: failed to load driver: vboxvideo
pathForIcon: Unable to load icon ././/share/scribus/icons/1_5_1/AppIcon.png: File not found
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site.py", line 564, in <module>
    main()
  File "/usr/lib64/python2.7/site.py", line 546, in main
    known_paths = addusersitepackages(known_paths)
  File "/usr/lib64/python2.7/site.py", line 276, in addusersitepackages
    user_site = getusersitepackages(kind)
  File "/usr/lib64/python2.7/site.py", line 244, in getusersitepackages
    user_base = getuserbase() # this will also set USER_BASE
  File "/usr/lib64/python2.7/site.py", line 230, in getuserbase
    from sysconfig import get_config_var
  File "/usr/lib64/python2.7/sysconfig.py", line 10, in <module>
    'stdlib': '{base}/'+sys.lib+'/python{py_version_short}',
AttributeError: 'module' object has no attribute 'lib'
Scribus Crash
-------------
Scribus crashes due to Signal #11
Speicherzugriffsfehler
Fahad-Alsaidi commented 7 years ago

@probonopd any help in above problem?

Fahad-Alsaidi commented 7 years ago

@sommerluk what is the problem with building CTL branch?

probonopd commented 7 years ago

Looks like it is having trouble finding a path.

A real solution might be to change the upstream code to be fully relocateable, i.e., never use absolute paths that are compiled in at compilation time. See https://github.com/limbahq/binreloc for more information.

sommerluk commented 7 years ago

@sommerluk what is the problem with building CTL branch?

cmake .
-- Shared Library Flags: 
-- Scribus 1.5.3.svn will be built and installed into /usr/local
-- Machine: x86_64-suse-linux, void pointer size: 8
-- Found target X86_64
-- Building for target x86_64-suse-linux
-- Using standard ApplicationDataDir. You can change it with -DAPPLICATION_DATA_DIR
-- ----- USE QT 5-----
-- ----- USE QT Widgets-----
-- ----- USE Qt5Gui -----
-- ----- USE QT 5 XML -----
-- ----- USE Qt5Network -----
-- ----- USE Qt5OpenGL -----
-- ----- USE Qt5LinguistTools -----
-- ----- USE Qt5Quick -----
-- ----- USE Qt5PrintSupport -----
-- Qt VERSION: 5.5.1
ZLIB Library Found OK
No OSG found, building without 3D Extension
JPEG Library Found OK
TIFF Library Found OK
Python Library Found OK
-- FreeType2 Library Found OK
CAIRO Library Found OK
CUPS Library Found OK
LIBXML2 Library Found OK
LCMS 2 ReleaseLibrary: /usr/lib64/liblcms2.so
LCMS 2 Debug Library: LCMS2_LIBRARY_DEBUG-NOTFOUND
LCMS 2 Library: /usr/lib64/liblcms2.so
LittleCMS-2 Library Found OK
FontConfig Found OK
-- Could NOT find HUNSPELL (missing:  HUNSPELL_LIBRARIES HUNSPELL_INCLUDE_DIR) 
Hunspell or its developer libraries NOT found - Disabling support for spell checking
PoDoFo NOT found - Disabling support for PDF embedded in AI
-- Boost version: 1.54.0
Boost Library Found OK
Building without GraphicksMagick (use -DWANT_GRAPHICSMAGICK=1 to enable)
-- Found poppler
-- Found poppler libs: /usr/lib64/libpoppler.so
-- Found poppler includes: /usr/include/poppler
-- checking for module 'librevenge-0.0'
--   package 'librevenge-0.0' not found
RPATH: lib/scribus/plugins/;
-- Qt5::CoreQt5::WidgetsQt5::GuiQt5::XmlQt5::NetworkQt5::OpenGL/usr/lib64/libxml2.so/usr/lib64/libz.so
-- checking for module 'libwpg-0.2'
--   package 'libwpg-0.2' not found
-- checking for module 'libmspub-0.0<=0.1'
--   package 'libmspub-0.0<=0.1' not found
-- checking for module 'libwpg-0.2'
--   package 'libwpg-0.2' not found
-- Building with Scripter 1
-- No source header files will be installed
-- /home/sommerluk/Dokumente/Ligatursatz/scribus/scribus-working-copy/resources/translations
-- The following GUI languages will be installed: 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/sommerluk/Dokumente/Ligatursatz/scribus/scribus-working-copy

works fine.

But then:

> make
[  0%] Built target scribus_zip_lib
[  1%] Built target scribus_colormgmt_lib
[  2%] Built target scribus_desaxe_lib
[  2%] Built target scribus_fonts_lib
[  3%] Built target scribus_styles_lib
[  3%] Building CXX object scribus/text/CMakeFiles/scribus_text_lib.dir/index.cpp.o
/home/sommerluk/Dokumente/Ligatursatz/scribus/scribus-working-copy/scribus/text/index.cpp: In member function ‘uint RunIndex::search(int) const’:
/home/sommerluk/Dokumente/Ligatursatz/scribus/scribus-working-copy/scribus/text/index.cpp:17:66: error: ‘const class std::vector<unsigned int>’ has no member named ‘cbegin’
  std::vector<uint>::const_iterator it = std::upper_bound(runEnds.cbegin(), runEnds.cend(), pos);
                                                                  ^
/home/sommerluk/Dokumente/Ligatursatz/scribus/scribus-working-copy/scribus/text/index.cpp:17:84: error: ‘const class std::vector<unsigned int>’ has no member named ‘cend’
  std::vector<uint>::const_iterator it = std::upper_bound(runEnds.cbegin(), runEnds.cend(), pos);
                                                                                    ^
scribus/text/CMakeFiles/scribus_text_lib.dir/build.make:69: recipe for target 'scribus/text/CMakeFiles/scribus_text_lib.dir/index.cpp.o' failed
make[2]: *** [scribus/text/CMakeFiles/scribus_text_lib.dir/index.cpp.o] Error 1
CMakeFiles/Makefile2:448: recipe for target 'scribus/text/CMakeFiles/scribus_text_lib.dir/all' failed
make[1]: *** [scribus/text/CMakeFiles/scribus_text_lib.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2
Fahad-Alsaidi commented 7 years ago

@sommerluk it seems that you are using compiler doesn't support c++ 11. add this CXX='g++ -std=c++11' before cmake command as following: CXX='g++ -std=c++11' cmake . -DCMAKE_INSTALL_PREFIX=/usr -DWANT_DEBUG=1

or use Qt 5.7 which will require to use c++11.

sommerluk commented 7 years ago

@Fahad-Alsaidi Thanks. Compiling works now.

sommerluk commented 7 years ago

Tested the commit. Spell checking works fine for german test cases.

sommerluk commented 7 years ago

Nevertheless, I would like to hear what you think about the two points that I mentioned in a previous comment:

  1. The list of the two characters (ZWNJ and SOFT HYPHEN) is duplicated at two different places. This might be dangerous because in the future, somebody who does not know that these two lists must be kept synchronized, could change only one of these two lists while leaving the other one unchanged, and this would lead to unexpected results.

  2. The list contains only two characters. I’m confident that this is enough for German typesetting (at least at 99,9%) and probably all Latin scripts. I’m not so sure for other scripts.

The complete list of Default_Ignorable_Code_Point in Unicode 9 from http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt is:

# ================================================

# Derived Property: Default_Ignorable_Code_Point
#  Generated from
#    Other_Default_Ignorable_Code_Point
#  + Cf (Format characters)
#  + Variation_Selector
#  - White_Space
#  - FFF9..FFFB (Annotation Characters)
#  - 0600..0605, 06DD, 070F, 08E2, 110BD (exceptional Cf characters that should be visible)

00AD          ; Default_Ignorable_Code_Point # Cf       SOFT HYPHEN
034F          ; Default_Ignorable_Code_Point # Mn       COMBINING GRAPHEME JOINER
061C          ; Default_Ignorable_Code_Point # Cf       ARABIC LETTER MARK
115F..1160    ; Default_Ignorable_Code_Point # Lo   [2] HANGUL CHOSEONG FILLER..HANGUL JUNGSEONG FILLER
17B4..17B5    ; Default_Ignorable_Code_Point # Mn   [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA
180B..180D    ; Default_Ignorable_Code_Point # Mn   [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
180E          ; Default_Ignorable_Code_Point # Cf       MONGOLIAN VOWEL SEPARATOR
200B..200F    ; Default_Ignorable_Code_Point # Cf   [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK
202A..202E    ; Default_Ignorable_Code_Point # Cf   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
2060..2064    ; Default_Ignorable_Code_Point # Cf   [5] WORD JOINER..INVISIBLE PLUS
2065          ; Default_Ignorable_Code_Point # Cn       <reserved-2065>
2066..206F    ; Default_Ignorable_Code_Point # Cf  [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
3164          ; Default_Ignorable_Code_Point # Lo       HANGUL FILLER
FE00..FE0F    ; Default_Ignorable_Code_Point # Mn  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16
FEFF          ; Default_Ignorable_Code_Point # Cf       ZERO WIDTH NO-BREAK SPACE
FFA0          ; Default_Ignorable_Code_Point # Lo       HALFWIDTH HANGUL FILLER
FFF0..FFF8    ; Default_Ignorable_Code_Point # Cn   [9] <reserved-FFF0>..<reserved-FFF8>
1BCA0..1BCA3  ; Default_Ignorable_Code_Point # Cf   [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND FORMAT UP STEP
1D173..1D17A  ; Default_Ignorable_Code_Point # Cf   [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE
E0000         ; Default_Ignorable_Code_Point # Cn       <reserved-E0000>
E0001         ; Default_Ignorable_Code_Point # Cf       LANGUAGE TAG
E0002..E001F  ; Default_Ignorable_Code_Point # Cn  [30] <reserved-E0002>..<reserved-E001F>
E0020..E007F  ; Default_Ignorable_Code_Point # Cf  [96] TAG SPACE..CANCEL TAG
E0080..E00FF  ; Default_Ignorable_Code_Point # Cn [128] <reserved-E0080>..<reserved-E00FF>
E0100..E01EF  ; Default_Ignorable_Code_Point # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
E01F0..E0FFF  ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>

# Total code points: 4173

# ================================================

I do not have the knowledge to tell, but maybe ZWJ and the script specific characters (Arabic, Hangul, Khmer, Mongolian) might be interesting. Would it be overkill to simply use the whole list?

Fahad-Alsaidi commented 7 years ago

@sommerluk I agree with you. please look at 930000e20bcf7db0c09d3d50ad0c9969057b7dde , if it looks fine please close this bug.

sommerluk commented 7 years ago

https://github.com/HOST-Oman/scribus/commit/930000e20bcf7db0c09d3d50ad0c9969057b7dde looks good to me. I’ve compiled it and tested it, and it works fine.

Thanks a lot for the efford!

Closing this issue. Later I’ll open a new one the the hypenation part only…