chrthomsen / pygrametl

Official repository for pygrametl - ETL programming in Python
http://pygrametl.org
BSD 2-Clause "Simplified" License
291 stars 41 forks source link

How to insert multiple rows at the same time using sql #19

Closed qianxuanyon closed 4 years ago

qianxuanyon commented 4 years ago

I want to achieve this


> INSERT INTO `userTable` (`user_id`, `user_name`) VALUES
> (1, 'dsf'),
> (2, 'fgy'),
> (3, 'faad');

Hope to have a simple example
chrthomsen commented 4 years ago

Do you just want to be able to execute the shown SQL or if you are looking for ways to make pygrametl Dimensions/FactTables do batch inserts?

qianxuanyon commented 4 years ago

Do you just want to be able to execute the shown SQL or if you are looking for ways to make pygrametl Dimensions/FactTables do batch inserts?

pygrametl Dimensions/FactTables do batch inserts I expected it to be inserting multiple rows at once, like my example

image

I expected him to execute like this But not


> INSERT INTO `userTable` (`user_id`, `user_name`) VALUES
> (1, 'dsf'),
> (2, 'fgy'),
> (3, 'faad');
chrthomsen commented 4 years ago

That probably depends on your PEP249 module. BacthFactTable uses the PEP249 cursor's executemany method. According to the PEP249 documentation

Modules are free to implement this method using multiple calls to the .execute() method or by using array operations to have the database process the sequence as a whole in one call.

qianxuanyon commented 4 years ago

That probably depends on your PEP249 module. BacthFactTable uses the PEP249 cursor's executemany method. According to the PEP249 documentation

Modules are free to implement this method using multiple calls to the .execute() method or by using array operations to have the database process the sequence as a whole in one call.

I am using pyhive's presto executemany method can be used

So is there any way to make the insertion speed faster?

I just want to find an easy way to batch insert When there is no Bulk Loading method

image

skejserjensen commented 4 years ago

I have looked at the PEP 249 connectors listed on Presto's homepage and PyHive seems to be using a loop with multiple calls to execute() while presto-python-client simply raises an exception if executeMany() is called. A quick look through PEP 249 connectors for other systems indicates that the problem is not exclusive to Presto, so adding support for inserting batches using one insert statement seems like a useful addition to pygrametl. As a quick and dirty fix right now I have created this alternative implementation of BatchFactTable.__insertnow().

def __insertnow(self):
    rowToValue = lambda row: '(' + ','.join(map(lambda c: "'" + row[c]
        + "'" if type(row[c]) is str else str(row[c]), self.all)) + ')'
    values = map(rowToValue, self.__batch)
    basesql = self.insertsql[:self.insertsql.find(' (') + 1]
    insertsql = basesql + ','.join(values)
    self.targetconnection.execute(insertsql)

If you decide to try using this version of __insertnow() then please report any changes in the performance of your program compared to using the version of __insertnow() using executeMany().

qianxuanyon commented 4 years ago

I have looked at the PEP 249 connectors listed on Presto's homepage and PyHive seems to be using a loop with multiple calls to execute() while presto-python-client simply raises an exception if executeMany() is called. A quick look through PEP 249 connectors for other systems indicates that the problem is not exclusive to Presto, so adding support for inserting batches using one insert statement seems like a useful addition to pygrametl. As a quick and dirty fix right now I have created this alternative implementation of BatchFactTable.__insertnow().

def __insertnow(self):
    rowToValue = lambda row: '(' + ','.join(map(lambda c: "'" + row[c]
        + "'" if type(row[c]) is str else str(row[c]), self.all)) + ')'
    values = map(rowToValue, self.__batch)
    basesql = self.insertsql[:self.insertsql.find(' (') + 1]
    insertsql = basesql + ','.join(values)
    self.targetconnection.execute(insertsql)

If you decide to try using this version of __insertnow() then please report any changes in the performance of your program compared to using the version of __insertnow() using executeMany().

I tried your method but the result cannot be submitted automatically

image

skejserjensen commented 4 years ago

Unfortunately, that seems to be my fault as the method I posted is missing a key feature, it does not delete the current batch after it is inserted. As BatchFactTable determines when to load a batch using equality only the first batch was loaded automatically. The problem should be fixed in this version, although the last batch should of course still be loaded by executing either ConnectionWrapper.commit() or BatchFactTable.endload() as it might be smaller than batchsize.

def __insertnow(self):
    if self.__batch:
        rowToValue = lambda row: '(' + \
            ','.join(map(lambda c: "'" + row[c] + "'" if type(row[c])
                         is str else str(row[c]), self.all)) + ')'
        values = map(rowToValue, self.__batch)
        basesql = self.insertsql[:self.insertsql.find(' (') + 1]
        insertsql = basesql + ','.join(values)
        self.targetconnection.execute(insertsql)
        self.__batch = []