Closed qianxuanyon closed 4 years ago
Do you just want to be able to execute the shown SQL or if you are looking for ways to make pygrametl Dimensions/FactTables do batch inserts?
Do you just want to be able to execute the shown SQL or if you are looking for ways to make pygrametl Dimensions/FactTables do batch inserts?
pygrametl Dimensions/FactTables do batch inserts I expected it to be inserting multiple rows at once, like my example
I expected him to execute like this But not
> INSERT INTO `userTable` (`user_id`, `user_name`) VALUES
> (1, 'dsf'),
> (2, 'fgy'),
> (3, 'faad');
That probably depends on your PEP249 module. BacthFactTable uses the PEP249 cursor's executemany method. According to the PEP249 documentation
Modules are free to implement this method using multiple calls to the .execute() method or by using array operations to have the database process the sequence as a whole in one call.
That probably depends on your PEP249 module. BacthFactTable uses the PEP249 cursor's executemany method. According to the PEP249 documentation
Modules are free to implement this method using multiple calls to the .execute() method or by using array operations to have the database process the sequence as a whole in one call.
I am using pyhive's presto executemany method can be used
So is there any way to make the insertion speed faster?
I just want to find an easy way to batch insert When there is no Bulk Loading method
I have looked at the PEP 249 connectors listed on Presto's homepage and PyHive seems to be using a loop with multiple calls to execute()
while presto-python-client simply raises an exception if executeMany()
is called. A quick look through PEP 249 connectors for other systems indicates that the problem is not exclusive to Presto, so adding support for inserting batches using one insert statement seems like a useful addition to pygrametl. As a quick and dirty fix right now I have created this alternative implementation of BatchFactTable.__insertnow()
.
def __insertnow(self):
rowToValue = lambda row: '(' + ','.join(map(lambda c: "'" + row[c]
+ "'" if type(row[c]) is str else str(row[c]), self.all)) + ')'
values = map(rowToValue, self.__batch)
basesql = self.insertsql[:self.insertsql.find(' (') + 1]
insertsql = basesql + ','.join(values)
self.targetconnection.execute(insertsql)
If you decide to try using this version of __insertnow()
then please report any changes in the performance of your program compared to using the version of __insertnow()
using executeMany()
.
I have looked at the PEP 249 connectors listed on Presto's homepage and PyHive seems to be using a loop with multiple calls to
execute()
while presto-python-client simply raises an exception ifexecuteMany()
is called. A quick look through PEP 249 connectors for other systems indicates that the problem is not exclusive to Presto, so adding support for inserting batches using one insert statement seems like a useful addition to pygrametl. As a quick and dirty fix right now I have created this alternative implementation ofBatchFactTable.__insertnow()
.def __insertnow(self): rowToValue = lambda row: '(' + ','.join(map(lambda c: "'" + row[c] + "'" if type(row[c]) is str else str(row[c]), self.all)) + ')' values = map(rowToValue, self.__batch) basesql = self.insertsql[:self.insertsql.find(' (') + 1] insertsql = basesql + ','.join(values) self.targetconnection.execute(insertsql)
If you decide to try using this version of
__insertnow()
then please report any changes in the performance of your program compared to using the version of__insertnow()
usingexecuteMany()
.
I tried your method but the result cannot be submitted automatically
Unfortunately, that seems to be my fault as the method I posted is missing a key feature, it does not delete the current batch after it is inserted. As BatchFactTable
determines when to load a batch using equality only the first batch was loaded automatically. The problem should be fixed in this version, although the last batch should of course still be loaded by executing either ConnectionWrapper.commit()
or BatchFactTable.endload()
as it might be smaller than batchsize
.
def __insertnow(self):
if self.__batch:
rowToValue = lambda row: '(' + \
','.join(map(lambda c: "'" + row[c] + "'" if type(row[c])
is str else str(row[c]), self.all)) + ')'
values = map(rowToValue, self.__batch)
basesql = self.insertsql[:self.insertsql.find(' (') + 1]
insertsql = basesql + ','.join(values)
self.targetconnection.execute(insertsql)
self.__batch = []
I want to achieve this