MusicConnectionMachine / UnstructuredData

In this project we will be scanning unstructured online resources such as the common crawl data set
GNU General Public License v3.0
3 stars 1 forks source link

Importing the data to the database #219

Closed pfent closed 7 years ago

pfent commented 7 years ago

Hey @MusicConnectionMachine/group-2,

In your run, you just dumped the data as JSON. So @kordianbruck and me started to bulk load the data to the DB. We first loaded everything into a single table, import.final and a single column data. From there we wrote some teeny tiny SQL, exploiting Postres' JSON capabilities 🎉.

Maybe you want to have this somewhere under version control…

-- Drop everything that references old data
TRUNCATE TABLE websites, contains CASCADE;

-- Add id
ALTER TABLE import.final
ADD COLUMN id TEXT NOT NULL DEFAULT gen_random_uuid();

-- populate websites table
INSERT INTO websites
    SELECT CAST(id AS UUID), 
           data->>'url' AS url,
           data->>'bloburl' AS blob_url,
           current_timestamp AS createdAt,
           current_timestamp AS updatedAt
    FROM import.final;

-- populate contains table
with occurences AS (
    SELECT id, data->>'occurrences' AS occ 
    FROM import.final
), elems AS (
    SELECT id, elems 
    FROM occurences, json_array_elements(cast(occ AS json)) AS elems
)
INSERT INTO contains
    SELECT gen_random_uuid() AS id,
           json_build_object(
                'term', e.elems->'term'->'value', 
                'positions', e.elems->'positions'
           ) AS occurences,
           current_timestamp AS createdAt,
           current_timestamp AS updatedAt,
           CAST(e.id AS UUID) AS websiteId,
           CAST(e.elems->'term'->>'entityId' AS UUID) AS entityId
    FROM elems e;
felixschorer commented 7 years ago

I could probably add it to the README

felixschorer commented 7 years ago

Added in #220

kordianbruck commented 7 years ago

Great! Done.