Drexel-UHC / NETS

FIles use to wrangle National Establishment Time Series datasets for neighborhood public health research
1 stars 0 forks source link

NETS Database #7

Closed stfran22 closed 10 months ago

stfran22 commented 1 year ago

This issue will be used to discuss the final NETS data warehousing solution.

stfran22 commented 1 year ago

@sjmelly I've updated the proposed NETS database design document. I'll bring it up to discuss at our Tuesday meeting. I'm thinking about having a meeting with you, Ran, and Matt about converging our ideas about an azure-hosted SQL database and using DBT for documentation/transparency/organization of common-use SQL scripts.

\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Data\NETS2019_Sample\NETS_2019SampleData\ProposedNETS2019DrexelDatabaseDesign20230609.docx

sjmelly commented 1 year ago

@stfran22 I realized I found "accu" as integer in \files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Documentation\ NETS Database Schemas 20221108.xlsx How does this file relate to ProposedNETS2019DrexelDatabaseDesign20230609.docx? Is it superceded by the new file? Is it still useful?

stfran22 commented 1 year ago

I believe this is just the current schema for the sample database, which can be found in the database itself, and has been superseded by ProposedNETS2019DrexelDatabaseDesign20230609.docx.

sjmelly commented 1 year ago

@stfran22 to meet with Matt and upload tables to Azure to include output from #13 and #4

stfran22 commented 1 year ago

Output from tract/zcta level measures will be added as new tables in NETS database.

Tract Table Columns tract CHAR 11 year SMALLINT category CHAR 3 category_c BIGINT category_d FLOAT category_type VARCHAR max 20

ZCTA Table Columns zcta CHAR 5 year SMALLINT category CHAR 3 category_c BIGINT category_d FLOAT category_type VARCHAR max 20

stfran22 commented 1 year ago

@sjmelly @jannettim We're going to need to add a few columns in the DunsLocation table to the database schema. Let me know if I'm missing anything from this sql statement to add the columns:

ALTER TABLE NETS.DunsLocation ADD ct10_distance REAL, ADD ZCTA5CE10 CHAR 5, ADD ZCTA_AreaLand REAL, ADD ZCTA_TotalArea REAL;

stfran22 commented 1 year ago

@sjmelly just want to confirm that we are actually not including zcta for the NETS2019 database, as per our meeting with kari last thursday

stfran22 commented 1 year ago

posting error from BusinessInfo bulk import:

100000 rows sent to SQL Server. Total sent: 136200000 SQLState = 25000, NativeError = 3908 Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Could not run BEGIN TRANSACTION in database 'cd3088c6-0b50-42cf-a8b3-47418f609254' because the database is in emergency mode or is damaged and must be restarted. SQLState = S1000, NativeError = 40197 Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The service has encountered an error processing your request. Please try again. Error code 9001. SQLState = S1000, NativeError = 40197 Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The service has encountered an error processing your request. Please try again. Error code 9001. SQLState = S1000, NativeError = 40197 Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The service has encountered an error processing your request. Please try again. Error code 3314. SQLState = S1000, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Unspecified error occurred on SQL Server. Connection may have been terminated by the server. BCP copy in failed

jannettim commented 1 year ago

@stfran22 that is an issue.

The docs say that error 40197 is triggered when "...the service is down due to software or hardware upgrades, hardware failures, or any other failover problems. The error code (%d) embedded within the message of error 40197 provides additional information about the kind of failure or failover that occurred. Some examples of the error codes are embedded within the message of error 40197 are 40020, 40143, 40166, and 40540."

So it's possible that there was some hardware issue at Azure's data center. I don't see any error messages or warnings on the Azure side now, it's possible that it has been resolved

Right now I don't see anything wrong with it (though I don't have log in permissions for the server). Have you tried signing in via sql server management studio and seeing if you can access it that way?

Microsoft also has a script to test connectivity if you are still having issues:

image

Here's the code in text so you can copy and paste:

$parameters = @{
        Server = '.database.windows.net'
        Database = ''  # Set the name of the database to test. 'master' is the default.
        User = ''  # Set the sign-in username. 'AzSQLConnCheckerUser' is the default.
        Password = ''  # Set the sign-in password. 'AzSQLConnCheckerPassword' is the default.

        ## Optional parameters (Default values are used if parameters are omitted.)
        SendAnonymousUsageData = $true  # Set as $true (default) or $false.
        RunAdvancedConnectivityPolicyTests = $true  # Set as $true (default) or $false.
            # This parameter loads the connectivity test library from Microsoft's GitHub repository.
        CollectNetworkTrace = $true  # Set as $true (default) or $false.
        #EncryptionProtocol = ''  # Supported values: 'Tls 1.0', 'Tls 1.1', 'Tls 1.2'
        # Without this parameter, the operating system chooses the protocol.
    }
    $ProgressPreference = "SilentlyContinue";
    $scriptUrlBase = 'raw.githubusercontent.com/Azure/SQL-Connectivity-Checker/master'
    Invoke-Command -ScriptBlock ([Scriptblock]::Create(
        (iwr ($scriptUrlBase+'/AzureSQLConnectivityChecker.ps1')).Content
    )) -ArgumentList $parameters
    #end
stfran22 commented 1 year ago

sweet thanks I'll try that out. It did load nearly half of the records it was supposed to, can I set a starting point to skip those and add the rest?

jannettim commented 1 year ago

Yes, provided the data is actually in the database, you should be able to set a starting point for the first row to load using the -F option in the bcp utility. You just have to find the number of the first row not inserted and supply that row as an argument. Alternatively you could create a new file that removes all of the already inserted rows and just load that file instead.

stfran22 commented 1 year ago

Was thrown this error when trying to load the rest of this table. I looked up the number of rows in the table (1362000000) and set -F to 1362000001. How is it trying to insert a duplicate key? Is there a way to figure out the last record inserted into a table? I may have to split this file in python or just start over from the top.

`C:\Users\stf45>bcp NETS.dbo.BusinessInfo in "\\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Data\NETS2019_Python\BusinessInfoDB20230803.txt" -S netsdata.database.windows.net -e error_file.txt -t \t -G -U stf45@drexel.edu -c -C 65001 -F 136200001 -b 100000

Starting copy...
SQLState = 23000, NativeError = 2627
Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Violation of PRIMARY KEY constraint 'PK__Business__0A4F72463A4588B9'. Cannot insert duplicate key in object 'dbo.BusinessInfo'. The duplicate key value is (136705659_2009).
SQLState = 01000, NativeError = 3621
Warning = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The statement has been terminated.

BCP copy in failed`

results from

SELECT *
FROM BusinessInfo
WHERE DunsNumber = '136705477' 
OR DunsNumber = '136705659'

DunsYear DunsNumber Year Company TradeName Emp Sales SIC 136705477_2004 136705477 2004 CREATIVE PHOTOGRAPHY INC NULL 1 50000 72210000 136705477_2005 136705477 2005 CREATIVE PHOTOGRAPHY INC NULL 1 50000 72210000 136705477_2006 136705477 2006 CREATIVE PHOTOGRAPHY INC NULL 1 50000 72210000 136705477_2007 136705477 2007 CREATIVE PHOTOGRAPHY INC NULL 1 50000 72210000 136705477_2008 136705477 2008 CREATIVE PHOTOGRAPHY INC NULL 1 50000 72210000 136705477_2009 136705477 2009 CREATIVE PHOTOGRAPHY INC NULL 1 47000 72210000 136705659_2004 136705659 2004 CRUMB S FAST TAX NULL 2 70000 72910000 136705659_2005 136705659 2005 CRUMB S FAST TAX NULL 2 66700 72910000 136705659_2006 136705659 2006 CRUMB S FAST TAX NULL 2 65000 72910000 136705659_2007 136705659 2007 CRUMB S FAST TAX NULL 2 63900 72910000 136705659_2008 136705659 2008 CRUMB S FAST TAX NULL 2 64000 72910000 136705659_2009 136705659 2009 CRUMB S FAST TAX NULL 2 66000 72910000

stfran22 commented 1 year ago

Ended up creating a subset of the full file where records are not loaded in database.

Grab DunsYears in db:

SELECT DunsYear 
FROM BusinessInfo 

Load full file in python, then subset for DunsYears not in partial file (records in db): fullfile.loc[~fullfile['DunsYear'].isin(partial['DunsYear'])]

Then export file and resume bcp upload. Will check number of records/unique DunsYears once load is complete.

stfran22 commented 1 year ago

bcp upload was interrupted, probably from the machine being restarted over the weekend. I subset in the same way as commented above and restarted. Then got the error below. Will subset and restart again before EOD today.

C:\Users\stf45>bcp NETS.dbo.BusinessInfo in "D:\NETS\NETS_2019\ProcessedData\BusinessInfoRemaining20230814.txt" -S netsdata.database.windows.net -e error_file.txt -t \t -G -U stf45@drexel.edu -c -C 65001 -F 2 -b 100000

Starting copy... 100000 rows sent to SQL Server. Total sent: 100000 100000 rows sent to SQL Server. Total sent: 200000 100000 rows sent to SQL Server. Total sent: 300000 100000 rows sent to SQL Server. Total sent: 400000 100000 rows sent to SQL Server. Total sent: 500000 100000 rows sent to SQL Server. Total sent: 600000 100000 rows sent to SQL Server. Total sent: 700000 100000 rows sent to SQL Server. Total sent: 800000 100000 rows sent to SQL Server. Total sent: 900000 100000 rows sent to SQL Server. Total sent: 1000000 100000 rows sent to SQL Server. Total sent: 1100000 100000 rows sent to SQL Server. Total sent: 1200000 100000 rows sent to SQL Server. Total sent: 1300000 SQLState = 08S01, NativeError = 10054 Error = [Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: An existing connection was forcibly closed by the remote host.

SQLState = 08S01, NativeError = 10054 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure SQLState = 08S01, NativeError = 0 Error = [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure

1300000 rows copied. Network packet size (bytes): 4096 Clock Time (ms.) Total : 9129563 Average : (142.39 rows per sec.)

stfran22 commented 1 year ago

@jannettim any way around this that isn't too costly?

SQLState = 37000, NativeError = 40544 Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The database 'NETS' has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.

jannettim commented 1 year ago

Well essentially you have hit 32 GB which is the default max size of the database. The only way around this (especially because we don't have indexes yet and there isn't any data that you would want to delete) is to increase the size of the database.

The most cost-effective way to do this is to increase the amount of space just enough to cover what you need. Each additional GB of data will cost an additional 12 cents. (increasing the size of the database also increase the size of the log. Right now storage alone will cost you about $5 per month. increasing it to 50 will cost about $8 per month (plus whatever usage billing you have in that time period).

How much data do you have left to upload?

jannettim commented 1 year ago

The formula for total cost of storage is:

(x + x.3).12

Where x is the total number of GB of the database.

stfran22 commented 1 year ago

I'd say about 20gb. Let me know if there's anything I can do to help move it forward!

jannettim commented 1 year ago

OK database is scaled to 52 GB, you should be able to insert again now. Let me know if you need anything else

stfran22 commented 1 year ago

I need to add the tract level measures to the database. With that, I need to add the tract level variables to the CategoryDescriptions table so we can filter by domain, etc.

@jannettim we also reached database capacity again. Would you mind increasing capacity by 15gb?

jannettim commented 1 year ago

@stfran22 I can do that. But before I do the cost of the first month's of the NETs data was more than expected. This was almost definitely due to the compute times for the load. Depending on how much is left to load into the database, this month should be significantly cheaper. So I just want to make sure that before we increase the costs any more that the team is aware of this.

Also going to notify @kam642 about it.

stfran22 commented 1 year ago

Ok thanks for letting me know. Is Jana's grant picking up the tab for this? Or Sharrelle's?

kam642 commented 1 year ago

@stfran22 @jannettim R&D core is covering the cost of NETS on Azure. Would like to keep costs down as much as possible.

jannettim commented 1 year ago

@kam642 I will say that storage is relatively cheap. An additional 15 GB of data will cost another $2.34. The expenses for storage will total $12.79/month after this increase.

So far, as of this morning, for about half the month, the NETS database has cost us a total of $9.67. Now, the load that @stfran22 will be doing will increase that depending on how long the load takes.

sjmelly commented 1 year ago

@jannettim Do you need to do something to give me access to the new NETS database? When I log into Azure I see: System Databases; GeospatialTest; NETSSample; SQLWorkshop; UHC_Admin; Integration Services Catalog

jannettim commented 1 year ago

@sjmelly the NETS database is on a different server, here is the address: netsdata.database.windows.net.

You will probably also need to have your account added to the database and be given permissions on the tables. However, since I am not allowed to see the full NETS 2019 data, I am not an admin on the database so I cannot do this. @stfran22 will need to give you access.

@stfran22 if you need help with doing this let me know.

stfran22 commented 1 year ago

@sjmelly Matt just helped me add you as an owner - let me know if it works now.

sjmelly commented 1 year ago

@jannettim @stfran22 I'm getting this message when connecting from my Drexel desktop on campus

Image

jannettim commented 1 year ago

Oh right I need to add your computer to the firewall. Give me a moment.

jannettim commented 1 year ago

OK, @sjmelly, your work machine should be added to the firewall whitelist. Try again.

sjmelly commented 1 year ago

Now I'm getting this message

Image

jannettim commented 1 year ago

It could be that the database had paused. Sometimes it takes a moment for the database to turn on (though Stevie was just on it so it shouldn't have been down). But Try once more and see if it works.

sjmelly commented 1 year ago

Still not working. I clicked on show details

Image

stfran22 commented 1 year ago

@sjmelly I just tried something Matt left me earlier. Try again!

jannettim commented 1 year ago

@stfran22 I don't think this is a database issue, I think it's a server problem. Usually that message just means that there was a lag in the server turning on so it couldn't find it. But it's weird, it's either shutting off super quick or not letting Steve on. @sjmelly if you still can't access it after Stevie's fix, do you have time to meet tomorrow? Maybe if I see what you are doing I can debug it a little more effectively? I should be free from 9-12 and from 1-3.

sjmelly commented 1 year ago

It's still not working @jannettim either of those times works for me

kam642 commented 1 year ago

@jannettim @stfran22 Ok to increase storage.

jannettim commented 1 year ago

@stfran22 @kam642 size of the database has increased

stfran22 commented 1 year ago

@jannettim short on space again. We now have 227,500,000 out of 252,088,222 records of the DunsMove (second to last) table uploaded.

jannettim commented 1 year ago

How much do you think you need?

cc: @kam642

stfran22 commented 1 year ago

@jannettim well that leaves around 10% of that file which should be about 1.3gb, plus 6.75gb for the tract-level file. Since there seems to be some other overhead, I'd just round up to 10g.

kam642 commented 1 year ago

@jannettim @stfran22 Ok to increase size.

stfran22 commented 1 year ago

I am looking into adding BEDDN tract-level measures as a view (possibly a materialized view) in the NETS database. This way, we won't have to import the entire tract-level dataset into the database because it is derived from data already in the database.

sjmelly commented 1 year ago

@stfran22 do DisplayX and DisplayY in DunsLocation only have 5 decimal places? That's how they appeared in my query. I usually recommend at least 6 decimal places for latitudes and longitudes.

stfran22 commented 1 year ago

Ah I see that's a limitation of the current data type for that column. We need to change it from REAL to FLOAT(53). I'll try that out. https://learn.microsoft.com/en-us/sql/t-sql/data-types/float-and-real-transact-sql?view=sql-server-ver16

sjmelly commented 1 year ago

We can use the geodatabase feature class saved to the server or the X and Y coordinates for this version, but it will be a good change to make in future versions

stfran22 commented 1 year ago

Tried to change the data type of one of the columns and the query took over an hour (canceled). We'll have to remember to do make this change upon creating the schema for the NETS 2022 database.

stfran22 commented 10 months ago

Closing this as we are focusing on NETS2022.