Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Escape characters break MySQL ingest behavior #55

Closed organisciak closed 9 years ago

organisciak commented 9 years ago

When your wordlist includes words with '\', Mysql treats is as an escape character and gives you a warning. Missing those words isn't important, because in all likelihood they're bad OCR, but the bigger concern is circumstances where the slash is followed by an n, t, single-quote, double-quote, etc, and MySQL recognizes it as valid input.

I'm currently running some processing but will add more information when I go back to recreating the warning. Assigning to myself.

Do we want to sanitize all words with '\'? Specifically, are there valid uses where we'd ever want special characters like newlines in words?

bmschmidt commented 9 years ago

Changing the title because I think the problem is broader: if I have an article whose title is "from \n to \r and back", the behavior may be very strange indeed.

Currently for word ingest the only word will be the single character \, but I can imagine wanting to extend to a tokenization that allows \n to be a token as well (for example, with a browser that has tokenization rules appropriate for historic source code.)

Solution 1

One solution as I glean it from here is to alter all LOAD DATA LOCAL INFILE functions in the sourcebase to include the line FIELDS ESCAPED BY ""

The docs warn against it because you might well have tabs, newlines, or null characters in your input. I think I might be OK with those rendering as their ASCII equivalents, because the docs explicitly say they shouldn't be in your JSON catalogs. But maybe there's a case for allowing them?

For output, if the FIELDS ESCAPED BY character is not empty, it is used to prefix the following characters on output:

The FIELDS ESCAPED BY character

The FIELDS [OPTIONALLY] ENCLOSED BY character

The first character of the FIELDS TERMINATED BY and LINES TERMINATED BY values ASCII 0 (what is actually written following the escape character is ASCII “0”, not a zero-valued byte)

If the FIELDS ESCAPED BY character is empty, no characters are escaped and NULL is output as NULL, not \N. It is probably not a good idea to specify an empty escape character, particularly if field values in your data contain any of the characters in the list just given.

Solution 2

Just manually escape every backslash to \\, so it will read in properly.

For the the words field, this is far and away the easiest. I only shy away because for other metadata fields, the overhead of a universal find-replace before writing isn't going to be zero, and might cause unanticipated problems elsewhere. (

organisciak commented 9 years ago

In the initial issue I was going to make the same comment about solution 2, that I don't want the overhead of find-replace on every single line. However, I ran a timeit for python's character in string match and it's trivial. Running 10 million '\' in 'string' searches consistently takes ~2.3 seconds, or run against randomized six character strings takes 2.1-3.5 seconds, Running on 10 million randomized 2-100 character strings takes 5.4s.

Running a check for backslashes and only escaping \ to \\ when necessary shouldn't add much overhead. Effectively, I expect it's the same as Solution 1 though.

The solutions are for supporting raw input, but this isn't the necessary approach. You can also:

What do you think? I'm leaning toward keeping input as is, implementing a check for unexpected backslashes (anything not in Table 9.1), and alerting people when their input doesn't seem properly escaped. This wouldn't warn somebody about "from \n to \r and back", but if they have a title like "From /usr/bin to C:\Program Files" and they missed the documentation about proper input, we can let them know.

bmschmidt commented 9 years ago

I realize I spoke too soon--actually, if you have a book whose title is "From /usr/bin to C:\Program Files," the only way to legitimately encode that as json for ingest is

In [6]: print json.dumps(r"From /usr/bin to C:\Program Files,")
"From /usr/bin to C:\\Program Files,"

which the current pipeline through MySQL ingest handles properly. I just tested on a fork of the federalist with author names like "\HAMILTON\"). My solution 1 above would break this behavior. I think it's best to just follow the JSON spec for when and how to handle control characters in metadata, so that seems easy enough, and to scope this problem as just about tokens for now.

So I think that this is specific to the tokens alone, and that we just need to str.replace("\\","\\\\") every str in the wordids . No need to check what the following character is, because the token \n should print literally, not as a newline. Any control characters besides newlines will display as tokens in the actual representation. (On a number of bookworms, this is actually already happening with the \f character--it populates straight through as a formfeed). The only problem is that you can't have an actual newline (as opposed the character sequence \n) be a token in a bookworm--but I'm OK with that.

I just pushed a patch to a new branch, but won't merge it to master immediately b/c maybe I'm missing something?

organisciak commented 9 years ago

Let me time if str.replace("\\", "\\\\") is comparable in performance to "if "\\" in str: str.replace("\\", "\\\\"). You never know how these things are optimized, I know character operations are much quicker than strings (e.g. the blazing fast speed of translate vs replace).

On Sun, Mar 1, 2015 at 2:23 PM Benjamin Schmidt notifications@github.com wrote:

I realize I spoke too soon--actually, if you have a book whose title is "From /usr/bin to C:\Program Files," the only way to legitimately encode that as json for ingest is

In [6]: print json.dumps(r"From /usr/bin to C:\Program Files,") "From /usr/bin to C:\Program Files,"

which the current pipeline through MySQL ingest handles properly. I just tested on a fork of the federalist with author names like "\HAMILTON\"). My solution 1 above would break this behavior. I think it's best to just follow the JSON spec for when and how to handle control characters in metadata, so that seems easy enough, and to scope this problem as just about tokens for now.

So I think that this is specific to the tokens alone, and that we just need to str.replace("\","\") every str in the wordids . No need to check what the following character is, because the token \n should print literally, not as a newline. Any control characters besides newlines will display as tokens in the actual representation. (On a number of bookworms, this is actually already happening with the \f character--it populates straight through as a formfeed). The only problem is that you can't have an actual newline (as opposed the character sequence \n) be a token in a bookworm--but I'm OK with that.

I just pushed a patch to a new branch https://github.com/Bookworm-project/BookwormDB/commit/abcd10908d150ad6fc75cdff7362621a6fed06f4, but won't merge it to master immediately b/c maybe I'm missing something?

— Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/55#issuecomment-76628347 .

organisciak commented 9 years ago

Okay, trivial difference (~400%, I was worried about orders of magnitude), on that front the patch looks good to me.

import timeit
import string
import random

def rstring(n):
    return ''.join([rchar() for i in range(n)])

def rchar():
    return random.choice(20*string.ascii_uppercase + 50*string.ascii_lowercase + string.punctuation)

def manystrings(n):
    return [rstring(random.choice(range(2,25))) for i in range(n)]

prep = "from __main__ import manystrings; import random; s=manystrings(10000);"

just_rand = timeit.Timer("st = random.choice(s)", setup=prep)
condition = timeit.Timer("st = random.choice(s); st.replace('\\\\', '\\\\\\\\')", setup=prep)
strreplace = timeit.Timer('''st = random.choice(s)
if '\\\\' in st:
    st.replace('\\\\', '\\\\\\\\')
''', setup=prep)
a = just_rand.timeit(1000000)
b = condition.timeit(1000000)
print b-a
>> 8.84245181084
c = strreplace.timeit(1000000)
print c-a
>> 2.15423703194
bmschmidt commented 9 years ago

Merging and closing issue.