Peratham / semanticvectors

Automatically exported from code.google.com/p/semanticvectors
Other
0 stars 0 forks source link

Escaping tokens containing pipe in text VectorStore #59

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Build an index over a Lucene index which has tokens with the pipe ("|") 
character (or just the pipe itself).
2. Convert the VectorStore to text format. 
3. Try to read it with any Semantic Vectors class.

Tokens containing the pipe character should be escaped or quoted somehow in the 
text file, so they can be easily parsed later. Or else adapt the text reading 
functions for these cases.

I'm using Semantic Vectors version 3.2 on Mint Linux.

Original issue reported on code.google.com by erickrfo...@gmail.com on 12 Jul 2012 at 6:51

GoogleCodeExporter commented 9 years ago
You're right, this would happen with the text format for vector stores, since 
the pipe character is used as a delimiter.

Do you have a format in mind that would work? E.g., something that could 
simply, readably, and reliably deal with the line:

"<bra|ket>|1.0|0.0|0.0\n"

Original comment by dwidd...@gmail.com on 13 Jul 2012 at 7:00

GoogleCodeExporter commented 9 years ago
I mentioned quoting the tokens, but now I think putting the token and its 
vector in separate lines should be better. So you would have:

<bra|ket>\n
1.0|0.0|0.0\n

So you don't have to worry about special treatment for quoting characters.

Original comment by erickrfo...@gmail.com on 13 Jul 2012 at 5:21

GoogleCodeExporter commented 9 years ago
Good idea, I like that a lot. Much better than introducing a new escape 
character to deal with the escape character.

Let's consider this to be the plan of record, and I'll implement and test it in 
the next week or so. If your need is more urgent than that and you're willing 
to try coding it up, feel free to give it a try - but if so, please drop me a 
line first because I should outline some of the expected "gotcha" dependencies 
before wasting anyone's time.

Original comment by dwidd...@gmail.com on 13 Jul 2012 at 5:49