Remove All Punctuation on scrub page causing problems related to tags

czhang03 commented 8 years ago

Please give us some of the following information, That will be really helpful to track the bug:

if you are proposing an enhancement, please delete all of these
which option did I check

Remove All Punctuation and Remove Tags(probably should be called handle tag)

which file did I use

Gen AB

what the bug looks like

the Remove All Punctuation removes punctuation within the tag, therefore damage the following info in the tag:

the closing tag (</a> turns into <a> after scrub)
the attribute in the tag (<a name='test'> turns into <a nametest> after scrub)
even the tag name itself. (<tei.2> turns into <tei> after scrub)

czhang03 commented 8 years ago

this is gonna be a huge change, if we really need to fix this... I will do this in my fork later.

scottkleinman commented 8 years ago

I think "Scrub Tags" is the best name for the option.

In terms of the main issue, I'm not sure there is anything we can do about it. Whenever users use xml or sgml, they had better be familiar with with the structural markup of their texts. We can provide warnings in ITM. Amongst these would be "don't use other scrubbing options". The proper way to use tag handling is to scrub the tags first and then perform regular scrubbing in a second pass. This is not a guarantee that any tags they leave will not be broken, but it's the workflow you have to use when dealing with marked up texts.

We could have tag handling skip the other scrubbing functions by default, but I'm not sure it's worth it.

scottkleinman commented 8 years ago

Oops! Crossing GitHub posts. Don't embark on anything major without reading my reply above.

czhang03 commented 8 years ago

I think of a way to fix this without major change in the code.

but I don't really understand what you are posting?

czhang03 commented 8 years ago

my way is to create a function that iterate through the passage, and every time to detect if I am in a tag, (this can easily be done by a variable keeping track of have we encountered equal number of < and >),

if not in tag, then we do the replace, else we just go through.

scottkleinman commented 8 years ago

The best way to do that would be to use an xml parser. But my point is that it is easiest to advise the user to scrub in two steps. First, remove tags without any other scrubbing options. Then scrub again with options like removing punctuation and converting to lower case. They should really be separate processes. It's possible that the user could leave in markup that would be affected by the subsequent scrubbing, but it's unlikely if they are forewarned.

czhang03 commented 8 years ago

separate the process will not help.

for example I have the text:

<tei.2>
<body name='chapter1'>
content
</body>
</tei.2>

we first handles tag, we keeps all the tag:

<tei.2>
<body name='chapter1'>
content
</body>
</tei.2>

and then we go to remove punc and digit:

<tei>
<body namechapter>
content
<body>
<tei>

This is no longer a valid xml, if user want to take it else where that will not be helpful. and if people want to set the end of each <chapter> tag(that is </chapter>) as milestone in cutter or rolling window, they cannot do that.

czhang03 commented 8 years ago

I tried my method, it is too slow, it slows the process down about 70 times (original GenAB takes 0.01 sec, this method takes 0.7 second to 1.0 second)

I will try to think of other method

czhang03 commented 8 years ago

I improve the function to be 10 times faster than the old version!!!!! which handles GenAB in 0.06 sec!!!

Cheers!

But there maybe bugs = =, I will test this more thoroughly tonight

czhang03 commented 8 years ago

the beauty of this method is that for large file without tags, the speed is the same as string.transform for mobi dick it only takes 0.167999982834 sec (string.transform uses 0.144000053406 sec).

and it keeps all the punctuation and digit inside of all the tag for the file that with tag.

czhang03 commented 8 years ago

https://github.com/chantisnake/Lexos/commit/1aec74473b743ea48812c8395e7829736d63607c (this is on my fork, so if you guys decided that this is good, I can push this to the master @scottkleinman @mleblanc321 )

this is the commit that fix this issue.

There is no major change of the code(only two lines in scrub.py, and a small function in general function). And it is fast and it works!

scottkleinman commented 8 years ago

Well, that's promising. Did you use libxml to parse the text? I probably won't be able to pull until late tonight my time, but I'm looking forward to seeing what you've done.

scottkleinman commented 8 years ago

Oops! That means go ahead and push it.

czhang03 commented 8 years ago

yeah, in fact, I did not use any parsing, it is just simple regex.

I will push it now (it is just two lines, if there is something wrong, you guys can always change it back)

scottkleinman commented 8 years ago

If we can, we should prevent tags from being made lower case. XML is case sensitive, so <p> and <P> are not the same thing.

czhang03 commented 8 years ago

okay, I can add that. it is easy

czhang03 commented 8 years ago

comment on this, for now the remove punctuation and digits are handled.

but what about lemma and consolidations? how should them be handled?

for example, if I put in a lemma, chapter -> milestone if we encounter a chapter tag, what should we do?

also, tolower is related to lemma and consolidations... they also turns to lower case like the passage...

so if they put in a lemma, Chapter -> Milestone, then if they check tolower, <chapter> will become <milestone>, and <Chapter> will not (because lemma is turned into lower case) but if they don't check tolower <Chapter> will become <Milstone>, but <chapter> will not

scottkleinman commented 8 years ago

If I understand you correctly, this is really a misuse of the lemma function, which is meant for consolidating variant tokens that are meant to be counted as a single semantic unit. If I were doing this, I'd probably wish to use the consolidation function. An example might be to consolidate <div1>, <div2>, etc as <div>. But event this is just using Lexos for search and replace. Somebody might try to do this. My first thought is caveat emptor. But I'll try to think of how we might handle it.

On the other hand, the example you give touches on a function I want to develop eventually. If you have structural tags like <chapter> or <div type="chapter">, it should be possible to convert those into milestones for use by Rolling Windows and other tools.

mleblanc321 commented 8 years ago

cheng: this is a very nice fix; i haven't reviewed the code yet, but just a shout out for your very good work;

RE: lemma and consolidations -- hmm, i need to think about this; returning to Cheng's example:

so if they put in a lemma, Chapter -> Milestone, then if they check tolower, will become , and will not (because lemma is turned into lower case)
[actually, with lowercase ON, would have been converted to , so both chapter tags will be converted to the milestone tag]

but if they don't check tolower will become , but will not [which is the "correct" usage, that is, as they requested: use upper-case lemma only ]

czhang03 commented 8 years ago

because if I reimplement the function of lowercase_exclude_tag will require huge amount of duplicated code(I don't like duplicated code), and the current we are scrubbing is inefficient.

Therefore I write a new function, that will keep the case of the tags(and of course white space and digit and punctuation) and is significantly faster than the original way we do it.

Here is the commit https://github.com/chantisnake/Lexos/commit/58d989e0618f8c0886a5ea5bd5ab5f2f4d37e66c I haven't do refactor and other documentation yet.

Here is some test(timing the whole scrub function):

Gen AB (file with tons of tags) the new function scurb in 0.0969998836517 sec the old function scurb in 0.163000106812 sec

Mobi Dick (large file) the new function scurb in 0.344000101089 sec the old function scurb in 0.534999847412 sec

which is all significantly faster than the original.

and there is not that much change in the new function...

czhang03 commented 8 years ago

I pushed a new change to handle lemma and conslidations https://github.com/chantisnake/Lexos/commit/c5cdc2d055a4bcb6275ef1c224d97265104adb19

see the comment I added for why do I do that.

czhang03 commented 8 years ago

This is fully fixed. We need a new function to lemmatize and consolidate tage next year.

WheatonCS / Lexos

Remove All Punctuation on scrub page causing problems related to tags #393

which option did I check

which file did I use

what the bug looks like