Closed czhang03 closed 8 years ago
this is gonna be a huge change, if we really need to fix this... I will do this in my fork later.
I think "Scrub Tags" is the best name for the option.
In terms of the main issue, I'm not sure there is anything we can do about it. Whenever users use xml or sgml, they had better be familiar with with the structural markup of their texts. We can provide warnings in ITM. Amongst these would be "don't use other scrubbing options". The proper way to use tag handling is to scrub the tags first and then perform regular scrubbing in a second pass. This is not a guarantee that any tags they leave will not be broken, but it's the workflow you have to use when dealing with marked up texts.
We could have tag handling skip the other scrubbing functions by default, but I'm not sure it's worth it.
Oops! Crossing GitHub posts. Don't embark on anything major without reading my reply above.
I think of a way to fix this without major change in the code.
but I don't really understand what you are posting?
my way is to create a function that iterate through the passage, and every time to detect if I am in a tag, (this can easily be done by a variable keeping track of have we encountered equal number of <
and >
),
if not in tag, then we do the replace, else we just go through.
The best way to do that would be to use an xml parser. But my point is that it is easiest to advise the user to scrub in two steps. First, remove tags without any other scrubbing options. Then scrub again with options like removing punctuation and converting to lower case. They should really be separate processes. It's possible that the user could leave in markup that would be affected by the subsequent scrubbing, but it's unlikely if they are forewarned.
separate the process will not help.
for example I have the text:
<tei.2>
<body name='chapter1'>
content
</body>
</tei.2>
we first handles tag, we keeps all the tag:
<tei.2>
<body name='chapter1'>
content
</body>
</tei.2>
and then we go to remove punc and digit:
<tei>
<body namechapter>
content
<body>
<tei>
This is no longer a valid xml, if user want to take it else where that will not be helpful.
and if people want to set the end of each <chapter>
tag(that is </chapter>
) as milestone in cutter or rolling window, they cannot do that.
I tried my method, it is too slow, it slows the process down about 70 times (original GenAB takes 0.01 sec, this method takes 0.7 second to 1.0 second)
I will try to think of other method
I improve the function to be 10 times faster than the old version!!!!! which handles GenAB in 0.06 sec!!!
Cheers!
But there maybe bugs = =, I will test this more thoroughly tonight
the beauty of this method is that for large file without tags, the speed is the same as string.transform
for mobi dick it only takes 0.167999982834 sec (string.transform
uses 0.144000053406 sec).
and it keeps all the punctuation and digit inside of all the tag for the file that with tag.
https://github.com/chantisnake/Lexos/commit/1aec74473b743ea48812c8395e7829736d63607c (this is on my fork, so if you guys decided that this is good, I can push this to the master @scottkleinman @mleblanc321 )
this is the commit that fix this issue.
There is no major change of the code(only two lines in scrub.py, and a small function in general function). And it is fast and it works!
Well, that's promising. Did you use libxml to parse the text? I probably won't be able to pull until late tonight my time, but I'm looking forward to seeing what you've done.
Oops! That means go ahead and push it.
yeah, in fact, I did not use any parsing, it is just simple regex.
I will push it now (it is just two lines, if there is something wrong, you guys can always change it back)
If we can, we should prevent tags from being made lower case. XML is case sensitive, so <p>
and <P>
are not the same thing.
okay, I can add that. it is easy
comment on this, for now the remove punctuation and digits are handled.
but what about lemma and consolidations? how should them be handled?
for example, if I put in a lemma, chapter -> milestone
if we encounter a chapter tag, what should we do?
also, tolower is related to lemma and consolidations... they also turns to lower case like the passage...
so if they put in a lemma, Chapter -> Milestone
,
then if they check tolower, <chapter>
will become <milestone>
, and <Chapter>
will not (because lemma is turned into lower case)
but if they don't check tolower <Chapter>
will become <Milstone>
, but <chapter>
will not
If I understand you correctly, this is really a misuse of the lemma function, which is meant for consolidating variant tokens that are meant to be counted as a single semantic unit. If I were doing this, I'd probably wish to use the consolidation function. An example might be to consolidate <div1>
, <div2>
, etc as <div>
. But event this is just using Lexos for search and replace. Somebody might try to do this. My first thought is caveat emptor. But I'll try to think of how we might handle it.
On the other hand, the example you give touches on a function I want to develop eventually. If you have structural tags like <chapter>
or <div type="chapter">
, it should be possible to convert those into milestones for use by Rolling Windows and other tools.
cheng: this is a very nice fix; i haven't reviewed the code yet, but just a shout out for your very good work;
RE: lemma and consolidations -- hmm, i need to think about this; returning to Cheng's example:
so if they put in a lemma, Chapter -> Milestone, then if they check tolower,
will become , and will not (because lemma is turned into lower case)
[actually, with lowercase ON,would have been converted to , so both chapter tags will be converted to the milestone tag] but if they don't check tolower
will become , but will not [which is the "correct" usage, that is, as they requested: use upper-case lemma only ]
because if I reimplement the function of lowercase_exclude_tag
will require huge amount of duplicated code(I don't like duplicated code),
and the current we are scrubbing is inefficient.
Therefore I write a new function, that will keep the case of the tags(and of course white space and digit and punctuation) and is significantly faster than the original way we do it.
Here is the commit https://github.com/chantisnake/Lexos/commit/58d989e0618f8c0886a5ea5bd5ab5f2f4d37e66c I haven't do refactor and other documentation yet.
Here is some test(timing the whole scrub function):
Gen AB
(file with tons of tags)
the new function scurb in 0.0969998836517 sec
the old function scurb in 0.163000106812 sec
Mobi Dick
(large file)
the new function scurb in 0.344000101089 sec
the old function scurb in 0.534999847412 sec
which is all significantly faster than the original.
and there is not that much change in the new function...
I pushed a new change to handle lemma and conslidations https://github.com/chantisnake/Lexos/commit/c5cdc2d055a4bcb6275ef1c224d97265104adb19
see the comment I added for why do I do that.
This is fully fixed. We need a new function to lemmatize and consolidate tage next year.
Please give us some of the following information, That will be really helpful to track the bug:
which option did I check
Remove All Punctuation
andRemove Tags
(probably should be called handle tag)which file did I use
Gen AB
what the bug looks like
the
Remove All Punctuation
removes punctuation within the tag, therefore damage the following info in the tag:</a>
turns into<a>
after scrub)<a name='test'>
turns into<a nametest>
after scrub)<tei.2>
turns into<tei>
after scrub)