HazyResearch / deepdive

DeepDive
deepdive.stanford.edu
1.95k stars 543 forks source link

ERROR: missing data for column "sentence_index" #647

Open hugochan opened 7 years ago

hugochan commented 7 years ago

Hi,

I am doing the Adding NLP markups step on my own corpus following the spouse example in tutorials. However, I got an error: missing data from column "sentence_index" after running the program for quite a while. I guess the DeepDive parser might have had trouble parsing one of the documents in my corpus. But I don't know what is the exact reason. I checked my corpus but found it nothing special.

P.S., I have successfully run it on another corpus without having this issue.

Any help would be highly appreciated!

zian92 commented 7 years ago

check if its related to string literals (tabs etc). ab tab in your content may break it if you don't escape it properly

Balachandar-R commented 6 years ago

Hi zian92,

Can u elaborate your explanation with an example.So many of us would get befitted.

Thanks, Bala

zian92 commented 6 years ago

I experienced DD to be a little sloppy with encoding (may be related to python 2.7). E.g.

@tsv_extractor
@returns(lambda
                 doc_id="text"
         : [])
def extract(
        id="text",
):
    yield "\t"

produces "ERROR: extra data after last expected column" as the string is not encoded properly and DD detects a 2nd column (which it doesn't expect). It took me some time to get this.

I am sure to be ran into this problem but unable to reproduce ist. @hugochan can you identify the text that produces the error? And at which step?

zian92 commented 6 years ago

maybe i was a little wrong here: i have the following UDF

@tsv_extractor
@returns(lambda
                 doc_id="text",
                 feature="text",
         : [])
def extract(
        doc_id="text",
        feature="text",
        counter="int",
):
    #(1)
    print sys.stderr, doc_id, feature, counter
    for _ in range(counter):
        yield [doc_id, feature]
    #(2)
    yield [doc_id, feature + " " + str(counter)]

if (1) is used ((2) as comment), then the UDF fails: ERROR: missing data for column "feature"

if (2) is used, then it works. I dont know why it's like that and don't see a difference.

Balachandar-R commented 6 years ago

Hi Zian92,

Thanks for your explanation.

I have one more issue with Python Encoding Character with BOM .

I have my python code to extract the contents from the documents and writing it in a tsv file but at this stage everything goes fine.

While i am processing the same(tsv) file with Deepdive. Deepdive identifies the character(1yQ11CQEAP1X ) from the tsv file and it causes a failure in deepdive do sentences. But I am not sure that this special character causes this issue.

Could u please help me to get rid off this issue.?

zian92 commented 6 years ago

@Balachandar-R : i don't see a relation to the original topic of this ticket ;)

It may be necessary to decode the rows from the database and to encode your results to be stored in the db. I am not familiar with BOM, but the web should provide an answer to your problem.

Balachandar-R commented 6 years ago

Hi Zian92,

Thanks for your answers.

I will have the following issue while "deepdive do sentences"

user@Azmachine:~/pedia$ deepdive do sentences ârun/RUNNINGâ -> â20170817/042914.419451517â 2017-08-17 04:29:14.710491 process/ext_sentences_by_nlp_markup/run.sh unloading: 0:00:00 715KiB [2.29MiB/s] ([2.29MiB/s]) unloading: 0:00:00 2 [6.55 /s] ([6.55 /s]) loading dd_tmp_sentences: 1:33:29 277 B [50.6miB/s] ([ 0 B/s]) loading dd_tmp_sentences: 1:33:29 5 [ 891u/s] ([ 0 /s])

along with

2017-08-17 04:31:10.411673 Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ...OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000072b980000, 113770496, 0) failed; error='Cannot allocate memory' (errno=12)

I am using Deepdive 0.8 stable version.

Thanks in advance, Bala

MahmoudYounes commented 6 years ago

@Balachandar-R can you try the following? in run.sh in udf/bazaar/parser try changing the -Xmx4g to -Xmx2g. basically, you are telling Stanford CoreNLP to use a maximum of 2gbs of RAM instead of 4. maybe this will help fix your problem