18F / rdbms-subsetter

Generates a subset of a relational database that respects foreign key constraints
Creative Commons Zero v1.0 Universal
313 stars 30 forks source link

Support MS SQL Server / MSSQL #11

Open apelade opened 9 years ago

apelade commented 9 years ago

MS SQL Server has been fairly popular in government, as well as the small businesses around government. It would be nice to support that.

apelade commented 9 years ago

So far I have tried a combination of drivers:

mssql+adodbapi: did not try - requires sqlalchemy @ version .6 latest is .9 mssql+pyodbc: requires python 3.3, test and mssql random func do insert data, error on test cases https://github.com/18F/rdbms-subsetter/pull/12 mssql+pymssql: trying this now, compatible with latest versions of dependencies. It says pysmssql is treating Decimal datatypes as float.

Both have a memory error when run on our real database, at least with some run parameters supplied.

apelade commented 9 years ago

pymssql:

Running with the mssql random function name as "rand()", https://github.com/apelade/rdbms-subsetter/commit/566d70b32ba2a471d99e229cef014635913f16c2 on our real 250 GB database, getting memory error with both pymssql and pyodbc. Getting an estimated number of rows might help?

Command: rdbms-subsetter -l mssql+pymssql://sa:password1@localhost:1433/source_dev mssql+pymssql://sa:password1@localhost:1433/dest_dev_sm 0.02

Error: ... Create 1 rows from 920634 in .be_dstrb_parm_data Create 1 rows from 921077 in .be_web_notices Create 1 rows from 923883 in .be_dstrb Create 1 rows from 9743 in .be_parm_data_h Create 1 rows from 994071 in .be_barn_estmt_h Proceed? (Y/n) y INFO:root:lowest completeness score (in tp_415_apye_over_limit_2012) at 0.000000 C:\Python34\lib\site-packages\sqlalchemy\dialects\mssql\pymssql.py:31: SAWarning: Dialect mssql+pymssql does not support Decimal objects natively, and SQLAlchemy must convert from floating point - rounding errors and other issues may occur. Please consider storing Decimal numbers as strings or integers on this platform for lossless storage. return sqltypes.Numeric.resultprocessor(self, dialect, type) INFO:root:lowest completeness score (in be_stat_rsn_ref_h) at 0.000000 INFO:root:lowest completeness score (in tp_415_apye_over_limit_2013) at 0.000000 INFO:root:lowest completeness score (in be_prcs_h) at 0.000000 Traceback (most recent call last): File "C:\Python34\Scripts\rdbms-subsetter-script.py", line 9, in load_entry_point('rdbms-subsetter==0.2.1', 'console_scripts', 'rdbms-subsetter')() File "C:\Python34\lib\site-packages\subsetter.py", line 398, in generate source.create_subset_in(target) File "C:\Python34\lib\site-packages\subsetter.py", line 318, in create_subset_in (source_row, prioritized) = target.source.next_row() File "C:\Python34\lib\site-packages\subsetter.py", line 132, in _next_row return (next(self.random_rows), False) # not prioritized File "C:\Python34\lib\site-packages\subsetter.py", line 114, in _random_row_gen_fn results = self.db.conn.execute(qry).fetchall() File "C:\Python34\lib\site-packages\sqlalchemy\engine\result.py", line 787, in fetchall self.cursor, self.context) File "C:\Python34\lib\site-packages\sqlalchemy\engine\base.py", line 1162, in _handle_dbapi_exception util.reraise(*exc_info) File "C:\Python34\lib\site-packages\sqlalchemy\util\compat.py", line 182, in reraise raise value File "C:\Python34\lib\site-packages\sqlalchemy\engine\result.py", line 781, in fetchall l = self.process_rows(self._fetchall_impl()) File "C:\Python34\lib\site-packages\sqlalchemy\engine\result.py", line 748, in _fetchall_impl return self.cursor.fetchall() MemoryError

apelade commented 9 years ago

_find_n_rows() doesn't try to handle mssql drivers, and later in update_sequence() there is a comment about only pg being supported, but not sure if sql server would need that

Try smaller databases