HazyResearch / deepdive

DeepDive
deepdive.stanford.edu
1.95k stars 543 forks source link

"deepdive do sentences" - Taking more time without creating any rows in the sentences table #655

Open Balachandar-R opened 6 years ago

Balachandar-R commented 6 years ago

Hi Team,

I have got the following last error statement while i manually stop by ctrl+c

2017-08-21 13:26:34.376376 ++ dirname process/ext_sentences_by_nlp_markup/run.sh 2017-08-21 13:26:34.376384 + cd process/ext_sentences_by_nlp_markup 2017-08-21 13:26:34.376392 + : ddtmp ddold 2017-08-21 13:26:34.376400 + export DEEPDIVE_CURRENT_PROCESS_NAME=process/ext_sentences_by_nlp_markup 2017-08-21 13:26:34.376407 + DEEPDIVE_CURRENT_PROCESS_NAME=process/ext_sentences_by_nlp_markup 2017-08-21 13:26:34.376415 + export DEEPDIVE_LOAD_FORMAT=tsj 2017-08-21 13:26:34.376423 + DEEPDIVE_LOAD_FORMAT=tsj 2017-08-21 13:26:34.376431 + output_relation=sentences 2017-08-21 13:26:34.376439 + output_relation_tmp=dd_tmp_sentences 2017-08-21 13:26:34.376447 + output_relation_old=dd_old_sentences 2017-08-21 13:26:34.376455 + deepdive create table-if-not-exists sentences 2017-08-21 13:26:35.423450 + deepdive create table dd_tmp_sentences like sentences 2017-08-21 13:26:38.080108 CREATE TABLE 2017-08-21 13:26:38.082247 + deepdive compute execute 'input_sql= 2017-08-21 13:26:38.082294 2017-08-21 13:26:38.082306 SELECT R0.id AS column_0 2017-08-21 13:26:38.082315 , R0.content AS column_1 2017-08-21 13:26:38.082323 FROM articles R0 2017-08-21 13:26:38.082331 2017-08-21 13:26:38.082339 ' 'command=cd "$DEEPDIVE_APP" && udf/nlp_markup.sh' output_relation=dd_tmp_sentences 2017-08-21 13:26:38.478602 Executing with the following configuration: 2017-08-21 13:26:38.478671 DEEPDIVE_NUM_PROCESSES=7 2017-08-21 13:26:38.478688 DEEPDIVE_NUM_PARALLEL_UNLOADS=1 2017-08-21 13:26:38.478703 DEEPDIVE_NUM_PARALLEL_LOADS=1 2017-08-21 13:26:38.478713 DEEPDIVE_NAMED_PIPES_DIR=/home/XXXXX/deepdive/examples/spouse/run/process/ext_sentences_by_nlp_markup 2017-08-21 13:26:38.820593 unloading to /home/XXXXX/deepdive/examples/spouse/run/process/ext_sentences_by_nlp_markup/deepdive-compute-execute.KD5A3y4/feed_processes-1: ' 2017-08-21 13:26:38.820675 2017-08-21 13:26:38.820688 SELECT R0.id AS column_0 2017-08-21 13:26:38.820696 , R0.content AS column_1 2017-08-21 13:26:38.820704 FROM articles R0 2017-08-21 13:26:38.820712 2017-08-21 13:26:38.820720 ' 2017-08-21 13:26:39.986708 Loading dd_tmp_sentences from /home/XXXXX/deepdive/examples/spouse/run/process/ext_sentences_by_nlp_markup/deepdive-compute-execute.KD5A3y4/output_computed-1 (tsj format)

Any suggestions?

Thanks Bala

mcavdar commented 6 years ago

Hi, It's normal that takes a lot of time compared to other steps. Did it create "dd_tmp_sentences" table or not ? And which dataset you use? I mean how big is your dataset, number of sentence etc..?

Balachandar-R commented 6 years ago

Hi mcavdar,

Thanks for you reply.

I have extracted the contents of various documents from a repository.( nearly 700 documents of PPT,PDF,DOCX and weg pages). I have created an article.tsv file which is having all the contents along with a content_id.

I can execute the deepdive do articles and which created a table of all these contents.

But, when i execute the deepdive do sentences command, which will create the dd_tmp_sentences and sentences table.I am using deepdive 0.8 version.

On executing the deepdive do sentences,

CoreNLP processes got started and finished with an error. I have pasted my logs here for your reference.

Last few lines for your reference.

2017-08-24 04:29:04.853776 INFO: Ignoring inactive rule: null 2017-08-24 04:29:04.854532 Aug 24, 2017 4:29:04 AM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules 2017-08-24 04:29:04.854563 INFO: Ignoring inactive rule: temporal-composite-8:ranges 2017-08-24 04:29:04.854789 Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt loading dd_tmp_sentences: 0:01:06 1 [15.1m/s] ([ 0 /s])2017-08-24 04:29:40.044975 Loading parser from serialloading dd_tmp_sentences: 0:01:07 1 [14.9m/s] ([ 0 /s])2017-08-24 04:29:40.415900 Warning: skipped malformedloading dd_tmp_sentences: 0:01:07 1 [14.9m/s] ([14.9m/s])B/s])ns, this is my first sentence"} 2017-08-24 04:29:40.458075 ERROR: missing data for column "sentence_index" 2017-08-24 04:29:40.458141 CONTEXT: COPY dd_tmp_sentences, line 1: "#" 2017-08-24 04:29:40.460150 /home/u5/local/util/compute-driver/local/compute-execute: line 140: kill: (50040) - No such process 2017-08-24 04:29:40.460183 /home/u5/local/util/compute-driver/local/compute-execute: line 140: kill: (50044) - No such process 2017-08-24 04:29:40.460196 /home/u5/local/util/compute-driver/local/compute-execute: line 140: kill: (50045) - No such process 2017-08-24 04:29:40.460204 /home/u5/local/util/compute-driver/local/compute-execute: line 140: kill: (50051) - No such process 2017-08-24 04:29:40.460213 /home/u5/local/util/compute-driver/local/compute-execute: line 140: kill: (50052) - No such process ârun/ABORTEDâ -> â20170824/042830.108135291â

I also want to know the actual nlp_backup.sh content.

please help me to resolve it.

Thanks, Bala

mcavdar commented 6 years ago

Hi @Balachandar-R ,

2017-08-24 04:29:40.458075 ERROR: missing data for column "sentence_index"

According to this, deepdive doesn't get clearly result/output of CoreNLP.

I suggest to run this piece of code from terminal, when your CoreNLP system is running. wget --post-data "Hello friend!" 'http://localhost:24688/?properties={"annotators": "tokenize, ssplit, pos, lemma,ner","outputFormat": "json"}' -O -

Expected output is something like that, with sentences and index tags:

{"sentences":[{"index":0,"tokens":[{"index":1,"word":"Hello","originalText":"Hello","lemma":"hello","characterOffsetBegin":0,"characterOffsetEnd":5,"pos":"PROPN","ner":"O"},{"index":2,"word":"friend","originalText":"friend","lemma":"friend","characterOffsetBegin":6,"characterOffsetEnd":12,"pos":"VERB","ner":"O"},{"index":3,"word":"!","originalText":"!","lemma":"!","characterO-

Is it same for you? And are you sure that Deepdive create articles table and load contents properly?

mcavdar commented 6 years ago

To start CoreNLP, run:

export CORENLP_JAVAOPTS=-Xmx4g
deepdive corenlp start

If you didn't install Corenlp yet, install it before start: deepdive corenlp install

Balachandar-R commented 6 years ago

Hi @mcavdar,

Thanks for your responses.

Since i am using deepdive 0.8, Bazaar parser is doing document parsing process.So that i could start CoreNLP as a service in this version. After exporting a variable called DEEPDIVE_NUM_PROCESSES=1 the parser can successfully running on all the documents without any errors.

Thanks @mcavdar

Balachandar-R commented 6 years ago

@mcavdar

I need to know how to customize the NER to identify some of the DOMAINS(for example if my docs contains any of the keywords like Banking,Healthcare and Education etc) in the given document content.

I have followed the official corenlp site to test the Customized NER in the parsers by coding. But i dont know where exactly we need to update this in DeepDive?

it would be great if you are able to help me out here.

Thanks, Balachandar

mcavdar commented 6 years ago

Hi @Balachandar-R ,

I tried to run spouse example on Deepdive 0.8. Actually, it uses an old Shift-Reduce Constituency Parser model. (srparser-2014-10-23-models) So I'm not sure but it doesn't seem like it finds the NER.

I hardly suggest you to update Deepdive. Thus, you can use CoreNLP either with NER functionality or with Regexner functionality. See for train your own NER. Also see Regexner example. If you are interested just with some predefined keyword, Regexner is easy way to do.

Balachandar-R commented 6 years ago

HI @mcavdar,

Thanks man.. I have updated DeepDive and now i can create the sentences table as well.

But i have one more question.

1.I have extracted the content from various repositories and writing it in a single file(.tsv) using python. But the fileformat of this .tsv file was

UTF-8 Unicode (with BOM) text, with very long lines

But the fileformats available in the spouse example is something like

UTF-8 Unicode text, with very long lines

And the Postgres sql also follows the UTF8 encoding scheme.

How could you see it from your perspective ? Any Suggestion?

Thanks Balachandar

romizc commented 6 years ago

Good morning, I'm having the same issue, I've downloaded the stable version of deepdive and the spouse example. Unfortunately I can't pass the process of NLP_markup, it takes too much time and memory and the process is aborted by the same deepdive. I've tried to update my version because I do not have the command "deepdive corenlp *", with no luck. Will you be so kind to tell me exactly how to make the deepdive update, please. Thanks in advance. My deepdive version is: deepdive v0.8.0-79-g28a58de (Linux x86_64) Information on this build of deepdive follows.

mcavdar commented 6 years ago

To build and install last version from source code: link. If you want to install from specific commit, do git checkout <commit_hash> after git clone https://github.com/HazyResearch/deepdive.git