Open hugochan opened 7 years ago
check if its related to string literals (tabs etc). ab tab in your content may break it if you don't escape it properly
Hi zian92,
Can u elaborate your explanation with an example.So many of us would get befitted.
Thanks, Bala
I experienced DD to be a little sloppy with encoding (may be related to python 2.7). E.g.
@tsv_extractor
@returns(lambda
doc_id="text"
: [])
def extract(
id="text",
):
yield "\t"
produces "ERROR: extra data after last expected column" as the string is not encoded properly and DD detects a 2nd column (which it doesn't expect). It took me some time to get this.
I am sure to be ran into this problem but unable to reproduce ist. @hugochan can you identify the text that produces the error? And at which step?
maybe i was a little wrong here: i have the following UDF
@tsv_extractor
@returns(lambda
doc_id="text",
feature="text",
: [])
def extract(
doc_id="text",
feature="text",
counter="int",
):
#(1)
print sys.stderr, doc_id, feature, counter
for _ in range(counter):
yield [doc_id, feature]
#(2)
yield [doc_id, feature + " " + str(counter)]
if (1) is used ((2) as comment), then the UDF fails: ERROR: missing data for column "feature"
if (2) is used, then it works. I dont know why it's like that and don't see a difference.
Hi Zian92,
Thanks for your explanation.
I have one more issue with Python Encoding Character with BOM .
I have my python code to extract the contents from the documents and writing it in a tsv file but at this stage everything goes fine.
While i am processing the same(tsv) file with Deepdive. Deepdive identifies the character(1yQ11CQEAP1X ) from the tsv file and it causes a failure in deepdive do sentences. But I am not sure that this special character causes this issue.
Could u please help me to get rid off this issue.?
@Balachandar-R : i don't see a relation to the original topic of this ticket ;)
It may be necessary to decode the rows from the database and to encode your results to be stored in the db. I am not familiar with BOM, but the web should provide an answer to your problem.
Hi Zian92,
Thanks for your answers.
I will have the following issue while "deepdive do sentences"
user@Azmachine:~/pedia$ deepdive do sentences ârun/RUNNINGâ -> â20170817/042914.419451517â 2017-08-17 04:29:14.710491 process/ext_sentences_by_nlp_markup/run.sh unloading: 0:00:00 715KiB [2.29MiB/s] ([2.29MiB/s]) unloading: 0:00:00 2 [6.55 /s] ([6.55 /s]) loading dd_tmp_sentences: 1:33:29 277 B [50.6miB/s] ([ 0 B/s]) loading dd_tmp_sentences: 1:33:29 5 [ 891u/s] ([ 0 /s])
along with
2017-08-17 04:31:10.411673 Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ...OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000072b980000, 113770496, 0) failed; error='Cannot allocate memory' (errno=12)
I am using Deepdive 0.8 stable version.
Thanks in advance, Bala
@Balachandar-R can you try the following? in run.sh in udf/bazaar/parser try changing the -Xmx4g to -Xmx2g. basically, you are telling Stanford CoreNLP to use a maximum of 2gbs of RAM instead of 4. maybe this will help fix your problem
Hi,
I am doing the Adding NLP markups step on my own corpus following the spouse example in tutorials. However, I got an error: missing data from column "sentence_index" after running the program for quite a while. I guess the DeepDive parser might have had trouble parsing one of the documents in my corpus. But I don't know what is the exact reason. I checked my corpus but found it nothing special.
P.S., I have successfully run it on another corpus without having this issue.
Any help would be highly appreciated!