HazyResearch / dd-genomics

The Genomics DeepDive project
Apache License 2.0
11 stars 6 forks source link

Sentences table schema doesn't match #138

Closed senwu closed 9 years ago

senwu commented 9 years ago

I am a litter confused about the schema of sentences table.

In tutorial, Step 3 said "Create the input data schema by running: ./util/create_input_schema.sh (NOTE that this will drop any input data already loaded into e.g. sentences or sentences_input tables)" and schema schema is CREATE TABLE sentences ( doc_id text, section_id text, sent_id int, ref_doc_id text, words text[], lemmas text[], poses text[], ners text[], dep_paths text[], dep_parents int[] ) DISTRIBUTED BY (doc_id, section_id);

But in Step 5, the schema for sentences table is: CREATE TABLE ${TABLE} ( doc_id TEXT, section_id TEXT, ref_doc_id TEXT, sent_id TEXT, words TEXT[], lemmas TEXT[], poses TEXT[], ners TEXT[], dep_paths TEXT[], dep_parents INT[] )

They are not the same, can you clarify it?

In the application.conf, we will create sentences_input from sentences. So, we don't need define its schema in Step 3, right?

Colossus commented 9 years ago

EDIT: PLEASE IGNORE THIS COMMENT SEE THE COMMENT BELOW

Hey Sen,

pelase disregard the create_input_schema.sh and create_schema.sh scripts --- these were created for a few less technically informed members a long while ago. Please use psql -f util/schema.sql and psql-f util/input_schema.sql to create the schema and input schema (WARNING: THESE SCRIPTS AUTOMATICALLY DROP THE TABLES BEFORE THEY ARE CREATED AGAIN. IF YOU JUST WANT ONE PARTICULAR TABLE, COPY PASTE IT MANUALLY.)

I'm going to remove the shell scripts.

Johannes

On 8/22/15 1:39 PM, SenWu wrote:

I am a litter confused about the schema of sentences table.

In tutorial, Step 3 said "Create the input data schema by running: ./util/create_input_schema.sh (NOTE that this will drop any input data already loaded into e.g. sentences or sentences_input tables)" and schema schema is CREATE TABLE sentences ( doc_id text, section_id text, sent_id int, ref_doc_id text, words text[], lemmas text[], poses text[], ners text[], dep_paths text[], dep_parents int[] ) DISTRIBUTED BY (doc_id, section_id);

But in Step 5, the schema for sentences table is: CREATE TABLE ${TABLE} ( doc_id TEXT, section_id TEXT, ref_doc_id TEXT, sent_id TEXT, words TEXT[], lemmas TEXT[], poses TEXT[], ners TEXT[], dep_paths TEXT[], dep_parents INT[] )

They are not the same, can you clarify it?

In the application.conf, we will create sentences_input from sentences. So, we don't need define its schema in Step 3, right?

— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/dd-genomics/issues/138.

Colossus commented 9 years ago

OK, actually I'm confused. Now I see where you got the second schema from. The load_sentences script apparently hasn't been updated. Usually only the first person ever after parsing uses the last command from this script to fill the sentences table. Everybody afterwards copies from their table.

If you want the full dataset, please do

psql -p 6432 -U senwu -h raiders2 -f /dfs/scratch0/jbirgmei/huge_sentences_input.sql <yourdatabasename>

ATTENTION: This will drop sentences_input in your current database.

This will save you a huge amount of trouble. The raw input data currently contains duplicate pubmed IDs and other crap. Please use the command in this comment to load your sentences_input table.

DO NOT run ./run.sh preprocess. This will delete sentences_input and attempt to refill it from sentences, which takes a long time and only the first person after parsing needs to do it.

Colossus commented 9 years ago

BTW I'm still in the process of copying huge_sentences_input.sql to my DFS scratch directory so please don't do it right away

Colossus commented 9 years ago

OK, copying is done, please go ahead with psql -p 6432 -U senwu -h raiders2 -f /dfs/scratch0/jbirgmei/huge_sentences_input.sql <yourdatabasename> now

ajratner commented 9 years ago

See the scripts in parser (see parser/README) for latest / most accurate way to load sentences table

I guess we should delete the step in the main documentation that talks about the input schema, sorry about that!

On Sat, Aug 22, 2015 at 3:17 PM Colossus notifications@github.com wrote:

OK, copying is done, please go ahead with psql -p 6432 -U senwu -h raiders2 -f /dfs/scratch0/jbirgmei/huge_sentences_input.sql

now — Reply to this email directly or view it on GitHub https://github.com/HazyResearch/dd-genomics/issues/138#issuecomment-133759039 .