Add Japanese support to DepCCGParser

KentaroAOKI commented 2 years ago

Updated DepCCGParser to support Japanese. The sample code is as follows.

1. Prepare depccg.

pip install cython numpy depccg
depccg_en download
depccg_ja download

2. Install Japanese fonts on Ubuntu.

apt install -y fonts-migmix
rm ~/.cache/matplotlib/fontlist-v330.json

3. Set the matplotlib Japanese font in the jupyter notebook python code.

import matplotlib
from matplotlib.font_manager import FontProperties

font_path = "/usr/share/fonts/truetype/migmix/migmix-1p-regular.ttf"
font_prop = FontProperties(fname=font_path)
matplotlib.rcParams["font.family"] = font_prop.get_name()

4. Use sentence2diagram in the jupyter notebook python code.

from lambeq import DepCCGParser
from discopy import grammar

parser = DepCCGParser(lang='ja')
diagram = parser.sentence2diagram('これはテストの文です。')
grammar.draw(diagram, figsize=(14,3), fontsize=12)

5. Use ansatz in the jupyter notebook python code.

from lambeq import AtomicType, IQPAnsatz

# Define atomic types
N = AtomicType.NOUN
S = AtomicType.SENTENCE

# Convert string diagram to quantum circuit
ansatz = IQPAnsatz({N: 1, S: 1}, n_layers=2)
discopy_circuit = ansatz(diagram)
discopy_circuit.draw(figsize=(15,10))

6. Use pytket in the jupyter notebook python code.

from pytket.circuit.display import render_circuit_jupyter

tket_circuit = discopy_circuit.to_tk()
render_circuit_jupyter(tket_circuit)

ianyfan commented 2 years ago

Hello, thank you for opening this PR, ashis is a feature we have wanted to add, so it will be very useful if we can get it merged. It looks like the right code paths are present; however, it would be great if we could work out if any heavy workloads can be cached, such as saving the supertagger as an instance variable instead of loading it each time. Very happy to discuss things if you want any help.

KentaroAOKI commented 2 years ago

Hi, I think the same thing. I'll make it on Saturday or Sunday.

KentaroAOKI commented 2 years ago

Hi, DepCCGParser has been fixed. Please review the code.

https://github.com/CQCL/lambeq/pull/24/commits/d3a5b4fac3999cd5c877cd31689c3796b7e61be3

ianyfan commented 2 years ago

@KentaroAOKI I have added some changes to the code, mainly formatting and documentation. Otherwise, the implementation seemed great. I just made a small change in how tokenising is handled, could you have a look and let me know if it works for you? Thanks.

dimkart commented 2 years ago

@KentaroAOKI I have added some changes to the code, mainly formatting and documentation. Otherwise, the implementation seemed great. I just made a small change in how tokenising is handled, could you have a look and let me know if it works for you? Thanks.

@ianyfan Some of the previous edits seem to have been overwritten by your last commit (e.g. lower-casing the language strings), please fix.

ianyfan commented 2 years ago

Thanks!

masakiowari commented 1 year ago

Hello! We are now working on Japanese QNLP by Lambeq following the installation of this page. We found that the present version of depccg_jp and Lambeq cannot treat sentences in which an adjectival verb (Keiyo-Do-Shi) modifies a noun. We give a list of sentences in which Lambeq + depccg_jp cannot create any string diagram. E..g 感動的な映画を見る曖昧な表現をする静かな海を見る健康な男性が歩く親切な男性がいる元気な男性が歩く上品な表現をするきれいな海を見る健やかな男性が歩く和やかな雰囲気を感じる穏やかな笑顔を浮かべる正直な男性がいる有名な男性がいるにぎやかな雰囲気を感じる特別な表現をする複雑な表現をするまじめな男性がいる下手な表現をする便利な本を買う朗らかな笑顔を浮かべる幸せな笑顔を浮かべる好きなスープを食べる無理な計画を立てる暇な男性がいる必要な計画を立てる邪魔なものをどかす変な表現をする自由な表現をする

We would like to know anyone who knows how to solve this problem.

By the way, this problem occurs when we use Lambeq ver. 0.2.6 and 0.3.1. We installed depccg_jp following the above instruction. Except for the sentence including adjectival verbs, depccg_jp + Lambeq works very well.

dimkart commented 1 year ago

@masakiowari Hi, I have a created a top-level issue (#99) for this, in order to give it more visibility. For any follow-up, please use Issue #99.

CQCL / lambeq