dataplat / dbatools

🚀 SQL Server automation and instance migrations have never been safer, faster or freer
https://dbatools.io
MIT License
2.47k stars 800 forks source link

Import-DbaCsv: Allow static column mappings for storing/tagging metadata of files imported. #6676

Open sk280 opened 4 years ago

sk280 commented 4 years ago

Summary of new feature

  1. When bulk importing multiple CSV files to tables, we need an way to store metadata for each file being loaded. Currently there seems to be no way to do this.

Example: Given csv file like so: TradeID, TradeDate, Symbol 1234, 20200712, EURUSD ;Loaded from 20200723_trades.csv 4567, 20200703, AUDUSD ;Loaded from 20200710_trades.csv

Given table like so: TradeID, TradeDate, Symbol, FileDate, FileName 1234, '2020-07-12', 'EURUSD', '2020-07-23', '20200723_trades.csv' 4567, '2020-07-03' 'AUDUSD', '2020-07-10', '20200710_trades.csv'

In the above example, I'd like to be able to extract file metadata and set it to a column OUTSIDE the csv file (for eg: set in the powershell script invoking Import-DbaCsv)

Proposed technical details (if applicable)

The syntax would be similar to: .PARAMETER StaticColumnMap KeyValue pair for inserting static data into every row.

Example: PS C:> $columns = @{

FileName = $csvFile.FileName FileDate = $csvFile.FileName.Split('_')[0] } PS C:> Import-DbaCsv -Path c:\temp\supersmall.csv -SqlInstance sql2016 -Database tempdb -StaticColumnMap $columns -Table $csvFile.TableName

In the above example, assuming, the data for columns: FileName and FileDate are loaded from the script and NOT from the CSV file.

Latest version of dbatools as of writing

The latest version available from PSGallery as on 2020-07-23 commit: e0fb7d4

MarkIannucci commented 4 years ago

@sk280 in thinking about how we might implement this functionality, I was thinking that if one uses an exclusive table lock, we could change the default value on the columns in question, to the static values specified in the map, load the data, and then change the default value on the columns back to what they were before the load.

That approach feels like a bit of a hack, especially as I think about the problems with various data types, but is one that would work for our use case.

@potatoqualitee , would you accept a PR to this cmdlet which utilizes that approach? If so, I think this is work we could take on.

nvarscar commented 4 years ago

Altering the table does seem like a hack and I'm not sure if that's acceptable for the function to modify the definition of the table. There is something, however, can be done here, I think. If I understand correctly, there is a need to add custom columns that use calculated values. That reminds me of Select-Object functionality where you can specify custom column definitions using a hashtable:

Select-Object *, @{n='NewColumn', e={$_.Name.Split('.')[0]} }

https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/select-object?view=powershell-7#example-10--create-calculated-properties-for-each-inputobject The parameter (DerivedColumn? CalculatedColumn?) would accept an array of hashtables that would consequently passed on to Select-Object, that receives data from the CSV object. That way you don't need to modify the table and re-invent the already existing piece of functionality.

MarkIannucci commented 4 years ago

@nvarscar I'm glad I'm not the only person who thinks my suggestion is a hack! Thank you for taking a look and offering another approach. The Select-Object approach is a superior one, if we can get it to work.

Unfortunately, I'm having trouble seeing how to implement your approach (which I gladly would) with the way the code is currently written because it uses LumenWorks.Framework.IO.Csv.CsvReader and I'm not sure how we can merge the Select-Object functionality you suggest into it.

That said, it is certainly worth exploration. I'll do a bit more analysis and see if I can get it to work and respond with my findings.

nvarscar commented 4 years ago

I should've checked for the source code before going for suggestions. I never expected a 3rd party framework to take care of the import. So scratch that, cause it seems there's another way. I'm not sure which version of LumenWorks we use here, but this particular fork supports adding custom column definition to the CSV object: https://github.com/phatcher/CsvReader If you check the example with custom columns, you can see that the columns there are defined with custom names. But, more importantly, the column object definition suggests that the columns can have DefaultValue. And this is how those default values are returned. Perhaps, that could help to build additional column definition here. AutoCreateTable would need some love though.

sk280 commented 4 years ago

Thanks all. I did have a look at the source code before creating the feature request.

As Mark alluded to, the third party framework is tightly coupled to implementation with no room to pass metadata. In my view, relying on a forked version of third-party framework is no less a hack. It makes it brittle and adds more coupling with external dependencies.

I also explored dynamic properties in PowerShell so I could transform the file before invoking this module. However this was neither easy nor efficient as the new columns with the static data had to become regular cells in the file pre-import (create new file on the fly before invoking this module, inefficient and slow when importing huge files with millions of rows) thus defeating the purpose of using this module: simplicity.

Given the nature of sqlBulkCopy, we'll need to either use some of streaming capability or change to a different or more feature rich CSV framework that is less coupled to this module.

Unfortunately, I'm having trouble seeing how to implement your approach (which I gladly would) with the way the code is currently written because it uses LumenWorks.Framework.IO.Csv.CsvReader and I'm not sure how we can merge the Select-Object functionality you suggest into it.

nvarscar commented 4 years ago

File metadata can be gathered prior to using Import-DbaCsv and passed as a CalculatedColumn, as long as that functionality is supported. And it just might, but it would require to look into LumenWorks CsvReader implementation.

sk280 commented 4 years ago

Yes I did try that but came to a stop when I saw the third party implementation. I'll also look to see if that framework has a means of accepting additional context in which case, this module can become a pass thru.

sk280 commented 4 years ago

@nvarscar It appears the fork is a drop in replacement that does support static mapping using default columns. It does look promising to use this version of the CsvReader by passing a StaticColumnMap, something like the below:

   `
   // Fix up the column defaults with the values we need
   reader.UseColumnDefaults = true;
   foreach(KeyValuePair<key, val> kvp in StaticColMap)
   {
       reader.Columns[reader.GetFieldIndex(kvp.key)] = kvp.val;
   } 
  ...
    // Now use SQL Bulk Copy to move the data
    using (var sbc = new SqlBulkCopy(connectionString))
   {
        ...
        sbc.AddColumnMapping("Ticker", "Ticker");

        sbc.WriteToServer(reader);
     }
   `

I'll see if I can submit a PR with the usage of the forked version of Lumenworks.Framework.IO you lined above.

wsmelton commented 4 years ago

We use the latest release of the package from here: https://www.nuget.org/packages/LumenWorksCsvReader This same version is the one you will find in the repository linked, phatcher/CsvReader.

sk280 commented 4 years ago

Fantastic, Thanks for clarifying @wsmelton, I'll work towards submitting a PR for it then.

wsmelton commented 4 years ago

Curious but could this not be done from the SqlBulkCopy side of code? Why not just add a parameter like -Metadata or something that accepts the hash of the column name and default value to add to the SqlBulkCopy object.

wsmelton commented 4 years ago

I'd note the change also needs to handle creating the table with the metadata columns...not just dealing with a table that already exists. So if the CVS column mapping is passed in and the additional metadata (default values) as well, then if the user includes -AutoCreateTable then we would need to account for those when we create the given target table.

sk280 commented 4 years ago

Sure, I'll explore the option of adding the parameter to SqlBulkCopy object. Yes I have considered and accounted for -AutoCreateTable.

sk280 commented 4 years ago

Hi @wsmelton - I attempted this today and found that: SqlBulkCopy does not provide an option to set a default column value. The public API does not state this to be possible. I really did not want to use an alternative framework as it was too much with no guarantees of scale that the current implementation promises.

However - inspecting the current code and making changes for AutoCreateTable made me realise that since a transaction is already in place, All I had to do was an UPDATE after inserting blanks for the new columns. AutoCreateTable would create the new columns in the StaticColumnMap. These columns would NEVER be referred in the ColumnMap - hence no mapping between FileHeaders and Target table.

So an update statement after the SBC WriteToServer by iterating through all the columns and setting the default values from the StaticColumnMap ensured that minimal data was sent over the wire (just the Update statement for the default values). This works perfectly. I still need to do more tests, but initial results seem promising. The approach is simple as well. When I have had a chance to do some more tests, I'll submit a PR. For now the changes are summarised for the benefit of others.

In New-SqlTable called like so:

New-SqlTable -Path $file -StaticColumns $StaticColumnMap.Keys -Delimiter $Delimiter -FirstRowHeader $FirstRowHeader -SqlConn $sqlconn -Transaction $transaction

  1. The new columns are created as part of AutoCreateTable like so:
            # Get SQL datatypes by best guess on first data row
            $sqldatatypes = @();
            #------NEW CODE BEGIN--------
            # Append static columns
            foreach ($column in $StaticColumns) {
                $columns += $column
            }
            #------NEW CODE END--------
            foreach ($column in $Columns) {
                $sqldatatypes += "[$column] varchar(MAX)"
            } 
  1. Then after the WriteToServer, the following code updates the table like so: `

                    $reader.Close()
                    $reader.Dispose()
    
                    #------NEW CODE BEGIN--------
                    # Generate StaticColumnMap updates
                    $sqlColDefaultValues = @();
    
                    if($StaticColumnMap) {
                        $sqlCol = ""
                        foreach ($columnname in $StaticColumnMap) {
                            foreach ($key in $columnname.Keys) {
                                 $val = $columnname[$key]
                                 $sqlCol = $key + " = '" + $val +"'"
                                 $sqlColDefaultValues += $sqlCol
                            }
                        }
    
                        if ($PSCmdlet.ShouldProcess($instance, "Performing Static column value UPDATE TABLE [$schema].[$table] on $Database")) {
                            $sql = "UPDATE [$schema].[$table] SET $($sqlColDefaultValues -join ' ,')"
                            $sqlcmd = New-Object System.Data.SqlClient.SqlCommand($sql, $sqlconn, $transaction)
                            try {
                                $null = $sqlcmd.ExecuteNonQuery()
                            } catch {
                                $errormessage = $_.Exception.Message.ToString()
                                Stop-Function -Continue -Message "Failed to execute $sql. `nDid you specify the proper delimiter? `n$errormessage"
                            }
                        }
                    }
    
                   #------NEW CODE END--------
                   $completed = $true

`

nvarscar commented 4 years ago

That assumes that the table was empty prior to running this command. What if it wasn't?

wsmelton commented 4 years ago

That would require documentation that clearly states the functionality.

Main reason is processes can be using a user that only has insert rights to a table and not update. So we just need to make it clear that the functionality requires update access on the given table as well.

sk280 commented 4 years ago

@nvarscar and @wsmelton - Sure, have considered all the above, thats why I have not submitted a PR yet. I need to understand the implications of all current functionality and its switches.

I needed something urgently and SqlBulkCopy mappings were not possible, so I did the above. When I'm satisfied with the results I will submit a PR.

Meanwhile - if the core team consider this issue important and implement it, that'll be fine by me.

sk280 commented 4 years ago

@nvarscar - One of the functionalities am considering is to 1) seperate the -AutoCreateMap add static columns functionality 2) Update static column values functionality.

This can be achieved by two switches to keep this decoupled. It also offers the flexibility of creating a column automatically but updating it externally (say after the call to Import-DbaCSV).

This will also ensure - The update switch can fail gracefully when its permissions are not met.

Obviously the issue of appending data to an existing column will break this in the absence of a BatchIdentifier of some kind to selectively update only those rows in the batch with the static values.

Ideally - if the SqlBulkCopy offered a way to set the default column value, these issues will not pose a problem. Requires further investigation though.

wsmelton commented 4 years ago

I have to ask now that it has been brought up: if you are doing an update statement why can't you (as a user) just do that outside of our command within your own process?

Update functionality, to me, is out of scope for an import command.

sk280 commented 4 years ago

@wsmelton - Because the current code has the full setup initialised and ready except this functionality that was missing. Often I'm required to bulk important multiple large files in a single go and needed a way to identify the filename and other associated metadata in the imported data.

Yes I can do the update externally, but will have reinvent the wheel at the time of import.

Like I said earlier, whilst it would be nice to have this feature as part of the core functionality, if its too much work or not enough priority, I can live with the my forked code with the functionality.

The only reason for suggesting this feature was because of the absolute simplicity of this module without having to use DTS or SSIS to bulk import and transfer multiple files on the go in the fastest transfer time possible.

Also in my searches, I can see a number of users requesting this functionality, so it'd be great if it can be part of the core functionality.

wsmelton commented 4 years ago

If users have requested this please add those issue numbers here both for reference and if those users want to join the conversation.

Yes I can do the update externally, but will have reinvent the wheel at the time of import.

Your initial proposal that you posted does an update. So you are going to change this to find a way to do this at import instead of a second call for the update?

sk280 commented 4 years ago

I just meant a number of users on the internet, like in forums and Stackoverflow have been wanting this functionality.

I meant , if I have to avoid UPDATE, but still retain SqlBulkCopy due to its transfer speeds, it either has to be an alternative CSV framework that implements this feature please see: https://github.com/phatcher/CsvReader/issues/70#issuecomment-539126790 or an alternative to SqlBulkCopy or maybe use a DataTable as a middle man to add derived columns before calling SQLBulkCopy. These were the options I had on my list.

At the end of the day, for what I needed urgently, the above changes using an update solved it for me and I can see myself using this for the foreseeable future, as this was the only functionality that I have always missed. I quite love this module as it has saved me heaps of time whilst being so simple to use.

potatoqualitee commented 3 years ago

This looks really cool but it's late and I'm not super grasping what is needed. Can you please check out this framework and let me know if it's supported? https://www.nuget.org/packages/CsvHelper/

That one is currently maintained, unlike phatcher and I'm hoping to switch to it when I've got the time.

wsmelton commented 3 years ago

I use that library in another project, very easy to get going but I don't go indepth with it yet.

https://github.com/thycotic-ps/thycotic.secretserver/blob/291317b288c167bd38ca177dfe77b4f93d008b45/src/Thycotic.SecretServer/cmdlets/logging/StartTssLogCmdlet.cs#L86