JuliaDatabases / PostgreSQL.jl

DEPRECATED: use LibPQ.jl instead
https://github.com/invenia/LibPQ.jl
Other
60 stars 39 forks source link

Updates are slower #25

Closed bmharsha closed 8 years ago

bmharsha commented 8 years ago

This is just a FYI

I executed 100000 update statements on a table using PostgreSQL.jl interface, I tracked the performance using @time macro, here is the output from the Macro:

1543.098305791 seconds (25583704 bytes allocated)

All other operations (Insert, Select) on the same dataset were really faster, except for the update, which was unusually slower .

The structure of the table is as follows

Column1 bigserial, Column2 VARCHAR(4000), Column3 decimal, Column4 DATE, Column5 smallint, Column6 VARCHAR(20) CHECK( JobType IN ('HR', 'Management', 'Accounts')), Column7 integer, Column8 bigint, Column9 bigint, Column10 FLOAT(25), Column11 FLOAT(52), Column12 CHAR, Column13 integer

I can share the dataset, if required.

iamed2 commented 8 years ago

It would help if you shared the query. Could you run the same query using a different interface (e.g. Python's psycopg2) and compare the relative times?

I would expect an update to be slower than an insert or select, but not too much slower unless the UPDATE conditions are complicated.

BTW, FLOAT(25) and FLOAT(52) both just alias to double precision in PostgreSQL.

bmharsha commented 8 years ago

Thanks for the "Double Precision" tip, here is the code I used to generate the queries, this code will generate queries for Creating a table, 100000 Insert statements, 100000 Update statements.

using Faker
using JLD, HDF5
number_of_datasets = 100000

function create_queries(number_of_datasets=100001)
  # TINYINT
  tint = [rand(-128:127) for i=1 : number_of_datasets]
  # SMALLINT
  sint = [rand(-32768:32767) for i=1 : number_of_datasets]
  # MEDIUMINT
  mint = [rand(-8388608:8388607) for i=1 : number_of_datasets]
  # INT and INTEGER
  rint = [rand(-2147483648:2147483647) for i=1 : number_of_datasets]
  # BIGINT
  bint = [rand(-9223372036854775808:9223372036854775807) for i=1 : number_of_datasets]
  # FLOAT
  rfloat = [rand(Float16) for i=1 : number_of_datasets]
  # DOUBLE
  dfloat = [rand(Float32) for i=1 : number_of_datasets]
  # DOULBE PRECISION, REAL, DECIMAL
  dpfloat = [rand(Float64) for i=1 : number_of_datasets]
  #DATETIME, TIMESTAMP, DATE
  datetime = [Faker.date_time_ad() for i=1 : number_of_datasets]
  #CHAR
  chara = [randstring(1) for i=1 : number_of_datasets]
  #VARCHAR , VARCHAR2, TINYTEXT, TEXT
  varcha = [randstring(rand(1:4000)) for i=1 : number_of_datasets]
  #ENUM
  JobType = ["HR", "Management", "Accounts"]
  enume = [JobType[rand(1:3)] for i=1 : number_of_datasets]
  return varcha, rfloat, datetime, tint, enume, mint, rint, bint, dfloat, dpfloat, chara, sint
end

function insert_queries(number_of_datasets=100001)
  Insert_Query = Array(String, (number_of_datasets))
  Insert_Query[1] = "CREATE TABLE Employee(ID bigserial, Name VARCHAR(4000), Salary decimal, LastLogin DATE, OfficeNo smallint, JobType VARCHAR(20) CHECK( JobType IN ('HR', 'Management', 'Accounts')), h integer, n bigint, z bigint, z1 double precision, z2 double precision, cha CHAR, empno integer)"
    varcha, rfloat, datetime, tint, enume, mint, rint, bint, dfloat, dpfloat, chara, sint = create_queries(number_of_datasets)
  for i=2 : (number_of_datasets)
          Insert_Query[i] = "INSERT INTO Employee (ID, Name, Salary, LastLogin, OfficeNo, JobType,h, n, z, z1, z2, cha, empno) VALUES ($(i-1), '$(varcha[i])', $(rfloat[i]), '$(datetime[i])', $(tint[i]), '$(enume[i])', $(mint[i]), $(rint[i]), $(bint[i]), $(dfloat[i]), $(dpfloat[i]), '$(chara[i])', $(sint[i]))"
  end
    return Insert_Query
end

function update_queries(number_of_datasets=100001)
    number_of_datasets = number_of_datasets -1
    Update_Query = Array(String, (number_of_datasets))
    varcha, rfloat, datetime, tint, enume, mint, rint, bint, dfloat, dpfloat, chara, sint = create_queries(number_of_datasets)
    for i=1 : (number_of_datasets)
      Update_Query[i] = "UPDATE Employee SET Name='$(varcha[i])', Salary=$(rfloat[i]), LastLogin='$(datetime[i])', OfficeNo=$(tint[i]), JobType='$(enume[i])', h=$(mint[i]), n=$(rint[i]), z=$(bint[i]), z1=$(dfloat[i]), z2=$(dpfloat[i]), cha='$(chara[i])', empno=$(sint[i]) where ID = $i"
    end
    return Update_Query
end

number_of_datasets = number_of_datasets + 1
Your_Create_Insert_Queries = insert_queries(number_of_datasets)
Your_Update_Queries = update_queries(number_of_datasets)
#save("Postgres_Insert_Queries.jld","Insert",insert_queries(number_of_datasets))
#save("update_queries","Update",update_queries(number_of_datasets))
bmharsha commented 8 years ago

I found the issue, the issue was; my Candidate key (ID) was not indexed in my test table, after indexing it, updates were much faster.