Alibaba-NLP / ACE

[ACL-IJCNLP 2021] Automated Concatenation of Embeddings for Structured Prediction
Other
296 stars 44 forks source link

Special characters and punctuation in ACE chunking #30

Closed Aatlantise closed 2 years ago

Aatlantise commented 2 years ago

Hello,

I've noticed some special characters and words attached to them get omitted in ACE chunking, as seen below:

"text": "This is an experiment: how do special chars & punctuations--like ~ (tilde) or * (star)--behave in ACE? #science",
"chunk_str": "<This> <is> <an experiment> <how> <special chars & punctuations--like> <~> <*> <in> <ACE> .",

"text": "This is an experiment how do special chars and punctuations like tilde or star behave in ACE? science",
"chunk_str": "<This> <is> <an experiment> <how> <special chars and punctuations> <like> <tilde or star> <behave> <in> <ACE> <science> ."

Here, :, (, ), # seem to be culprits. For some reason, <do> also disappears in both examples.

Would you have a complete list of such characters? I'm trying to create some kind of preprocessing module that would strip input sentences of them.

Much thanks!

wangxinyu0922 commented 2 years ago

Hi,

Does ``chunk_str'' mean the chunking output of the ACE model? Can you provide the exact input and the output file/screenshot for the problem you met? I think the code will not omit these characters.