Closed senwu closed 9 years ago
EDIT: PLEASE IGNORE THIS COMMENT SEE THE COMMENT BELOW
Hey Sen,
pelase disregard the create_input_schema.sh and create_schema.sh scripts --- these were created for a few less technically informed members a long while ago. Please use psql -f util/schema.sql and psql-f util/input_schema.sql to create the schema and input schema (WARNING: THESE SCRIPTS AUTOMATICALLY DROP THE TABLES BEFORE THEY ARE CREATED AGAIN. IF YOU JUST WANT ONE PARTICULAR TABLE, COPY PASTE IT MANUALLY.)
I'm going to remove the shell scripts.
Johannes
On 8/22/15 1:39 PM, SenWu wrote:
I am a litter confused about the schema of sentences table.
In tutorial, Step 3 said "Create the input data schema by running: ./util/create_input_schema.sh (NOTE that this will drop any input data already loaded into e.g. sentences or sentences_input tables)" and schema schema is CREATE TABLE sentences ( doc_id text, section_id text, sent_id int, ref_doc_id text, words text[], lemmas text[], poses text[], ners text[], dep_paths text[], dep_parents int[] ) DISTRIBUTED BY (doc_id, section_id);
But in Step 5, the schema for sentences table is: CREATE TABLE ${TABLE} ( doc_id TEXT, section_id TEXT, ref_doc_id TEXT, sent_id TEXT, words TEXT[], lemmas TEXT[], poses TEXT[], ners TEXT[], dep_paths TEXT[], dep_parents INT[] )
They are not the same, can you clarify it?
In the application.conf, we will create sentences_input from sentences. So, we don't need define its schema in Step 3, right?
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/dd-genomics/issues/138.
OK, actually I'm confused. Now I see where you got the second schema from. The load_sentences script apparently hasn't been updated. Usually only the first person ever after parsing uses the last command from this script to fill the sentences table. Everybody afterwards copies from their table.
If you want the full dataset, please do
psql -p 6432 -U senwu -h raiders2 -f /dfs/scratch0/jbirgmei/huge_sentences_input.sql <yourdatabasename>
ATTENTION: This will drop sentences_input in your current database.
This will save you a huge amount of trouble. The raw input data currently contains duplicate pubmed IDs and other crap. Please use the command in this comment to load your sentences_input table.
DO NOT run ./run.sh preprocess
. This will delete sentences_input and attempt to refill it from sentences, which takes a long time and only the first person after parsing needs to do it.
BTW I'm still in the process of copying huge_sentences_input.sql to my DFS scratch directory so please don't do it right away
OK, copying is done, please go ahead with psql -p 6432 -U senwu -h raiders2 -f /dfs/scratch0/jbirgmei/huge_sentences_input.sql <yourdatabasename>
now
See the scripts in parser (see parser/README) for latest / most accurate way to load sentences table
I guess we should delete the step in the main documentation that talks about the input schema, sorry about that!
On Sat, Aug 22, 2015 at 3:17 PM Colossus notifications@github.com wrote:
OK, copying is done, please go ahead with psql -p 6432 -U senwu -h raiders2 -f /dfs/scratch0/jbirgmei/huge_sentences_input.sql
now — Reply to this email directly or view it on GitHub https://github.com/HazyResearch/dd-genomics/issues/138#issuecomment-133759039 .
I am a litter confused about the schema of sentences table.
In tutorial, Step 3 said "Create the input data schema by running: ./util/create_input_schema.sh (NOTE that this will drop any input data already loaded into e.g. sentences or sentences_input tables)" and schema schema is CREATE TABLE sentences ( doc_id text, section_id text, sent_id int, ref_doc_id text, words text[], lemmas text[], poses text[], ners text[], dep_paths text[], dep_parents int[] ) DISTRIBUTED BY (doc_id, section_id);
But in Step 5, the schema for sentences table is: CREATE TABLE ${TABLE} ( doc_id TEXT, section_id TEXT, ref_doc_id TEXT, sent_id TEXT, words TEXT[], lemmas TEXT[], poses TEXT[], ners TEXT[], dep_paths TEXT[], dep_parents INT[] )
They are not the same, can you clarify it?
In the application.conf, we will create sentences_input from sentences. So, we don't need define its schema in Step 3, right?