Open erip opened 1 year ago
It seems like my WSTok
is the issue and that it doesn't meet the expected interface (__call__
should return (tok, start, stop)). If I use tp.util.SimpleTokenizer(pattern="\w+")
, it seems to be OK... this is somewhat unexpected, though so maybe documentation can be slightly improved.
Hi, @erip
Could you share some pieces of the file 10_line_pretokenized_corpus.tsv
for reproducing?
A similar error is not reproduced in the sample text I have, so it is not easy to determine the cause.
If you share the file where the problem is reproduced, it will be of great help to find the cause.
@bab2min are you using the WSTok here? It should cause the error
Ooops sorry @erip , I forgot this feed entirely.
Yes, I used WSTok
and it worked well.
Since I don't have tm_model.bin
and 10_line_pretokenized_corpus.tsv
, I ran the code, which is modifed like:
class WSTok:
def __call__(self, raw, **kwargs):
return raw.split()
docs = ["this is test text", "this is another text", "somewhat long text...."]
corpus = tp.utils.Corpus(tokenizer=WSTok(), stopwords=[])
corpus.process(doc for doc in docs)
for doc in corpus:
print(doc)
# it will print
# <tomotopy.Document with words="this is test text">
# <tomotopy.Document with words="this is another text">
# <tomotopy.Document with words="somewhat long text....">
I suspect that some lines in the 10_line_pretokenized_corpus.tsv
corrupt the inner c++ code.
I am migrating away from
model.make_doc
totp.util.Corpus
and am finding that using Corpus segfaults. My tiny repro is here:When I run this, I see:
Running this with
catchsegv
shows these relevant lines:which seems to point here... maybe
d.get()
is null?