apertium / apertium-apy

📦 Apertium HTTP Server in Python
https://wiki.apertium.org/wiki/Apertium-apy
GNU General Public License v3.0
32 stars 42 forks source link

Chained translation accumulates unknown word marks #44

Open sushain97 opened 7 years ago

sushain97 commented 7 years ago

e.g.

meow (en->es) meow meow (es->fr) **meow

so

meow (en->fr) **meow

instead of

meow (en->fr) *meow

sushain97 commented 7 years ago

@unhammer ideas (aside from manually removing the marks)?

unhammer commented 7 years ago

I think we'll just have to regex them away (or, into one), like the "remove error marks" thing already does.

(Ideally, lt-proc could switch on-the-fly between marking and non-marking using some stream-signal. I'd rather not start separate pipelines for with and without marks; that sounds like more complex if-then's and memory usage.)

sushain97 commented 7 years ago

@shardulc, could you take care of this? The regex for error marks is already in the APy code iirc.

shardulc commented 7 years ago

44783f93 fixes this, and is only a three-line change after #43 is merged. Not opening a PR right now because all previous commits for chained translation show up too.

sushain97 commented 7 years ago

@shardulc there are other unknown word marks other than *, such as #. There should be a regex floating around somewhere in APy or html-tools that is more comprehensive.

shardulc commented 7 years ago

@sushain97 I took the one in that commit directly from here, which only has the asterisks. Is a different regex used anywhere?

sushain97 commented 7 years ago

Hm... perhaps not.

https://github.com/goavki/streamparser/blob/master/streamparser.py#L28-L38

unhammer commented 7 years ago

In released pairs, we shouldn't have # (if the language data was completely testvoc'd), so I don't think we should worry about those.

SAP-20 commented 3 years ago

Can i work on the issue??

TinoDidriksen commented 3 years ago

@SAP-20, you don't have to ask for that. If you want to fix it, just fix it and submit a PR.