Closed stfran22 closed 10 months ago
@sjmelly I've updated the proposed NETS database design document. I'll bring it up to discuss at our Tuesday meeting. I'm thinking about having a meeting with you, Ran, and Matt about converging our ideas about an azure-hosted SQL database and using DBT for documentation/transparency/organization of common-use SQL scripts.
\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Data\NETS2019_Sample\NETS_2019SampleData\ProposedNETS2019DrexelDatabaseDesign20230609.docx
@stfran22 I realized I found "accu" as integer in \files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Documentation\ NETS Database Schemas 20221108.xlsx How does this file relate to ProposedNETS2019DrexelDatabaseDesign20230609.docx? Is it superceded by the new file? Is it still useful?
I believe this is just the current schema for the sample database, which can be found in the database itself, and has been superseded by ProposedNETS2019DrexelDatabaseDesign20230609.docx.
@stfran22 to meet with Matt and upload tables to Azure to include output from #13 and #4
Output from tract/zcta level measures will be added as new tables in NETS database.
Tract Table Columns tract CHAR 11 year SMALLINT category CHAR 3 category_c BIGINT category_d FLOAT category_type VARCHAR max 20
ZCTA Table Columns zcta CHAR 5 year SMALLINT category CHAR 3 category_c BIGINT category_d FLOAT category_type VARCHAR max 20
@sjmelly @jannettim We're going to need to add a few columns in the DunsLocation table to the database schema. Let me know if I'm missing anything from this sql statement to add the columns:
ALTER TABLE NETS.DunsLocation ADD ct10_distance REAL, ADD ZCTA5CE10 CHAR 5, ADD ZCTA_AreaLand REAL, ADD ZCTA_TotalArea REAL;
@sjmelly just want to confirm that we are actually not including zcta for the NETS2019 database, as per our meeting with kari last thursday
posting error from BusinessInfo bulk import:
100000 rows sent to SQL Server. Total sent: 136200000 SQLState = 25000, NativeError = 3908 Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Could not run BEGIN TRANSACTION in database 'cd3088c6-0b50-42cf-a8b3-47418f609254' because the database is in emergency mode or is damaged and must be restarted. SQLState = S1000, NativeError = 40197 Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The service has encountered an error processing your request. Please try again. Error code 9001. SQLState = S1000, NativeError = 40197 Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The service has encountered an error processing your request. Please try again. Error code 9001. SQLState = S1000, NativeError = 40197 Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The service has encountered an error processing your request. Please try again. Error code 3314. SQLState = S1000, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Unspecified error occurred on SQL Server. Connection may have been terminated by the server. BCP copy in failed
@stfran22 that is an issue.
The docs say that error 40197 is triggered when "...the service is down due to software or hardware upgrades, hardware failures, or any other failover problems. The error code (%d) embedded within the message of error 40197 provides additional information about the kind of failure or failover that occurred. Some examples of the error codes are embedded within the message of error 40197 are 40020, 40143, 40166, and 40540."
So it's possible that there was some hardware issue at Azure's data center. I don't see any error messages or warnings on the Azure side now, it's possible that it has been resolved
Right now I don't see anything wrong with it (though I don't have log in permissions for the server). Have you tried signing in via sql server management studio and seeing if you can access it that way?
Microsoft also has a script to test connectivity if you are still having issues:
Here's the code in text so you can copy and paste:
$parameters = @{
Server = '.database.windows.net'
Database = '' # Set the name of the database to test. 'master' is the default.
User = '' # Set the sign-in username. 'AzSQLConnCheckerUser' is the default.
Password = '' # Set the sign-in password. 'AzSQLConnCheckerPassword' is the default.
## Optional parameters (Default values are used if parameters are omitted.)
SendAnonymousUsageData = $true # Set as $true (default) or $false.
RunAdvancedConnectivityPolicyTests = $true # Set as $true (default) or $false.
# This parameter loads the connectivity test library from Microsoft's GitHub repository.
CollectNetworkTrace = $true # Set as $true (default) or $false.
#EncryptionProtocol = '' # Supported values: 'Tls 1.0', 'Tls 1.1', 'Tls 1.2'
# Without this parameter, the operating system chooses the protocol.
}
$ProgressPreference = "SilentlyContinue";
$scriptUrlBase = 'raw.githubusercontent.com/Azure/SQL-Connectivity-Checker/master'
Invoke-Command -ScriptBlock ([Scriptblock]::Create(
(iwr ($scriptUrlBase+'/AzureSQLConnectivityChecker.ps1')).Content
)) -ArgumentList $parameters
#end
sweet thanks I'll try that out. It did load nearly half of the records it was supposed to, can I set a starting point to skip those and add the rest?
Yes, provided the data is actually in the database, you should be able to set a starting point for the first row to load using the -F
option in the bcp utility. You just have to find the number of the first row not inserted and supply that row as an argument. Alternatively you could create a new file that removes all of the already inserted rows and just load that file instead.
Was thrown this error when trying to load the rest of this table. I looked up the number of rows in the table (1362000000) and set -F to 1362000001. How is it trying to insert a duplicate key? Is there a way to figure out the last record inserted into a table? I may have to split this file in python or just start over from the top.
`C:\Users\stf45>bcp NETS.dbo.BusinessInfo in "\\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Data\NETS2019_Python\BusinessInfoDB20230803.txt" -S netsdata.database.windows.net -e error_file.txt -t \t -G -U stf45@drexel.edu -c -C 65001 -F 136200001 -b 100000
Starting copy...
SQLState = 23000, NativeError = 2627
Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Violation of PRIMARY KEY constraint 'PK__Business__0A4F72463A4588B9'. Cannot insert duplicate key in object 'dbo.BusinessInfo'. The duplicate key value is (136705659_2009).
SQLState = 01000, NativeError = 3621
Warning = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The statement has been terminated.
BCP copy in failed`
results from
SELECT *
FROM BusinessInfo
WHERE DunsNumber = '136705477'
OR DunsNumber = '136705659'
DunsYear DunsNumber Year Company TradeName Emp Sales SIC 136705477_2004 136705477 2004 CREATIVE PHOTOGRAPHY INC NULL 1 50000 72210000 136705477_2005 136705477 2005 CREATIVE PHOTOGRAPHY INC NULL 1 50000 72210000 136705477_2006 136705477 2006 CREATIVE PHOTOGRAPHY INC NULL 1 50000 72210000 136705477_2007 136705477 2007 CREATIVE PHOTOGRAPHY INC NULL 1 50000 72210000 136705477_2008 136705477 2008 CREATIVE PHOTOGRAPHY INC NULL 1 50000 72210000 136705477_2009 136705477 2009 CREATIVE PHOTOGRAPHY INC NULL 1 47000 72210000 136705659_2004 136705659 2004 CRUMB S FAST TAX NULL 2 70000 72910000 136705659_2005 136705659 2005 CRUMB S FAST TAX NULL 2 66700 72910000 136705659_2006 136705659 2006 CRUMB S FAST TAX NULL 2 65000 72910000 136705659_2007 136705659 2007 CRUMB S FAST TAX NULL 2 63900 72910000 136705659_2008 136705659 2008 CRUMB S FAST TAX NULL 2 64000 72910000 136705659_2009 136705659 2009 CRUMB S FAST TAX NULL 2 66000 72910000
Ended up creating a subset of the full file where records are not loaded in database.
Grab DunsYears in db:
SELECT DunsYear
FROM BusinessInfo
Load full file in python, then subset for DunsYears not in partial file (records in db):
fullfile.loc[~fullfile['DunsYear'].isin(partial['DunsYear'])]
Then export file and resume bcp upload. Will check number of records/unique DunsYears once load is complete.
bcp upload was interrupted, probably from the machine being restarted over the weekend. I subset in the same way as commented above and restarted. Then got the error below. Will subset and restart again before EOD today.
C:\Users\stf45>bcp NETS.dbo.BusinessInfo in "D:\NETS\NETS_2019\ProcessedData\BusinessInfoRemaining20230814.txt" -S netsdata.database.windows.net -e error_file.txt -t \t -G -U stf45@drexel.edu -c -C 65001 -F 2 -b 100000
Starting copy... 100000 rows sent to SQL Server. Total sent: 100000 100000 rows sent to SQL Server. Total sent: 200000 100000 rows sent to SQL Server. Total sent: 300000 100000 rows sent to SQL Server. Total sent: 400000 100000 rows sent to SQL Server. Total sent: 500000 100000 rows sent to SQL Server. Total sent: 600000 100000 rows sent to SQL Server. Total sent: 700000 100000 rows sent to SQL Server. Total sent: 800000 100000 rows sent to SQL Server. Total sent: 900000 100000 rows sent to SQL Server. Total sent: 1000000 100000 rows sent to SQL Server. Total sent: 1100000 100000 rows sent to SQL Server. Total sent: 1200000 100000 rows sent to SQL Server. Total sent: 1300000 SQLState = 08S01, NativeError = 10054 Error = [Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: An existing connection was forcibly closed by the remote host.
SQLState = 08S01, NativeError = 10054 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure
1300000 rows copied. Network packet size (bytes): 4096 Clock Time (ms.) Total : 9129563 Average : (142.39 rows per sec.)
@jannettim any way around this that isn't too costly?
SQLState = 37000, NativeError = 40544 Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The database 'NETS' has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.
Well essentially you have hit 32 GB which is the default max size of the database. The only way around this (especially because we don't have indexes yet and there isn't any data that you would want to delete) is to increase the size of the database.
The most cost-effective way to do this is to increase the amount of space just enough to cover what you need. Each additional GB of data will cost an additional 12 cents. (increasing the size of the database also increase the size of the log. Right now storage alone will cost you about $5 per month. increasing it to 50 will cost about $8 per month (plus whatever usage billing you have in that time period).
How much data do you have left to upload?
The formula for total cost of storage is:
(x + x.3).12
Where x is the total number of GB of the database.
I'd say about 20gb. Let me know if there's anything I can do to help move it forward!
OK database is scaled to 52 GB, you should be able to insert again now. Let me know if you need anything else
I need to add the tract level measures to the database. With that, I need to add the tract level variables to the CategoryDescriptions table so we can filter by domain, etc.
@jannettim we also reached database capacity again. Would you mind increasing capacity by 15gb?
@stfran22 I can do that. But before I do the cost of the first month's of the NETs data was more than expected. This was almost definitely due to the compute times for the load. Depending on how much is left to load into the database, this month should be significantly cheaper. So I just want to make sure that before we increase the costs any more that the team is aware of this.
Also going to notify @kam642 about it.
Ok thanks for letting me know. Is Jana's grant picking up the tab for this? Or Sharrelle's?
@stfran22 @jannettim R&D core is covering the cost of NETS on Azure. Would like to keep costs down as much as possible.
@kam642 I will say that storage is relatively cheap. An additional 15 GB of data will cost another $2.34. The expenses for storage will total $12.79/month after this increase.
So far, as of this morning, for about half the month, the NETS database has cost us a total of $9.67. Now, the load that @stfran22 will be doing will increase that depending on how long the load takes.
@jannettim Do you need to do something to give me access to the new NETS database? When I log into Azure I see: System Databases; GeospatialTest; NETSSample; SQLWorkshop; UHC_Admin; Integration Services Catalog
@sjmelly the NETS database is on a different server, here is the address: netsdata.database.windows.net.
You will probably also need to have your account added to the database and be given permissions on the tables. However, since I am not allowed to see the full NETS 2019 data, I am not an admin on the database so I cannot do this. @stfran22 will need to give you access.
@stfran22 if you need help with doing this let me know.
@sjmelly Matt just helped me add you as an owner - let me know if it works now.
@jannettim @stfran22 I'm getting this message when connecting from my Drexel desktop on campus
Oh right I need to add your computer to the firewall. Give me a moment.
OK, @sjmelly, your work machine should be added to the firewall whitelist. Try again.
Now I'm getting this message
It could be that the database had paused. Sometimes it takes a moment for the database to turn on (though Stevie was just on it so it shouldn't have been down). But Try once more and see if it works.
Still not working. I clicked on show details
@sjmelly I just tried something Matt left me earlier. Try again!
@stfran22 I don't think this is a database issue, I think it's a server problem. Usually that message just means that there was a lag in the server turning on so it couldn't find it. But it's weird, it's either shutting off super quick or not letting Steve on. @sjmelly if you still can't access it after Stevie's fix, do you have time to meet tomorrow? Maybe if I see what you are doing I can debug it a little more effectively? I should be free from 9-12 and from 1-3.
It's still not working @jannettim either of those times works for me
@jannettim @stfran22 Ok to increase storage.
@stfran22 @kam642 size of the database has increased
@jannettim short on space again. We now have 227,500,000 out of 252,088,222 records of the DunsMove (second to last) table uploaded.
How much do you think you need?
cc: @kam642
@jannettim well that leaves around 10% of that file which should be about 1.3gb, plus 6.75gb for the tract-level file. Since there seems to be some other overhead, I'd just round up to 10g.
@jannettim @stfran22 Ok to increase size.
I am looking into adding BEDDN tract-level measures as a view (possibly a materialized view) in the NETS database. This way, we won't have to import the entire tract-level dataset into the database because it is derived from data already in the database.
@stfran22 do DisplayX and DisplayY in DunsLocation only have 5 decimal places? That's how they appeared in my query. I usually recommend at least 6 decimal places for latitudes and longitudes.
Ah I see that's a limitation of the current data type for that column. We need to change it from REAL to FLOAT(53). I'll try that out. https://learn.microsoft.com/en-us/sql/t-sql/data-types/float-and-real-transact-sql?view=sql-server-ver16
We can use the geodatabase feature class saved to the server or the X and Y coordinates for this version, but it will be a good change to make in future versions
Tried to change the data type of one of the columns and the query took over an hour (canceled). We'll have to remember to do make this change upon creating the schema for the NETS 2022 database.
Closing this as we are focusing on NETS2022.
This issue will be used to discuss the final NETS data warehousing solution.