Agazoth / PSParquet

Read and write parquet files from the PowerShell commandline
MIT License
12 stars 0 forks source link

Is PsParquet able to mark a CSV column as a datetime object #7

Closed lboening closed 1 month ago

lboening commented 1 month ago

Describe the bug It's not a bug, but a question how to mark a column as a particular type

To Reproduce

  1. Create CSV file from Windows Event Logs (get-winevent)
  2. CSV column includes value called "TimeCreatedUtc"
  3. String value in CSV is "2024-01-01T12:15:00.123Z"
  4. Use Import-CSV to read file. Use PsObject as input for PsParquet
  5. Import Parquet file into Azure Data Explorer.
  6. Azure Data Explorer marks incoming column as string

Expected behavior PSParquet marks the type of the column as a datetime object.

Screenshots None.

PowerShell info

Desktop (please complete the following information):

Additional context None

lboening commented 1 month ago
PS D:\code\Powershell> $dt =get-date -Year 2024 -Month 8 -Day 4 -Hour 12 -Minute 12 -Second 12 -Millisecond 999
PS D:\code\Powershell> import-module -name psparquet
PS D:\code\Powershell> Export-Parquet -FilePath c:\alldata\dt.parquet -InputObject $dt

Operation failed: Object reference not set to an instance of an object.
WARNING: InputObjects contains unsupported values. Transform the data prior to running Export-Parquet.

I get this too: WARNING: InputObjects contains unsupported values. Transform the data prior to running Export-Parquet.

Agazoth commented 1 month ago

Hi Luke,

First off, thank you for trying out my module and thank you for reaching out.

The reason TimeCreatedUtc from Get-WinEvent is a string in your initial question is the export to csv you do. When you export typed data to csv most of the types get stripped off.

This example shows, how types are converted when exporting data to csv:

$data = 1..100 | foreach {
       [pscustomobject]@{
           Date        = (Get-Date).AddHours($_)
           Int32       = $_
           TypedInt    = [int]$_
           IntWithNull = (($_ % 3 -eq 0) ? $null : $_)
           Text        = "Iteration $_"
       }
   }

$data | gm

   TypeName: System.Management.Automation.PSCustomObject

Name        MemberType   Definition
----        ----------   ----------
Equals      Method       bool Equals(System.Object obj)
GetHashCode Method       int GetHashCode()
GetType     Method       type GetType()
ToString    Method       string ToString()
Date        NoteProperty datetime Date=21/09/2024 09.22.02
Int32       NoteProperty System.Int32 Int32=1
IntWithNull NoteProperty System.Int32 IntWithNull=1
Text        NoteProperty string Text=Iteration 1
TypedInt    NoteProperty int TypedInt=1
$data | Export-CSV C:\Temp\data.csv

PS> Import-CSV C:\Temp\data.csv | Get-Member

   TypeName: System.Management.Automation.PSCustomObject

Name        MemberType   Definition
----        ----------   ----------
Equals      Method       bool Equals(System.Object obj)
GetHashCode Method       int GetHashCode()
GetType     Method       type GetType()
ToString    Method       string ToString()
Date        NoteProperty string Date=21/09/2024 09.22.02
Int32       NoteProperty string Int32=1
IntWithNull NoteProperty string IntWithNull=1
Text        NoteProperty string Text=Iteration 1
TypedInt    NoteProperty string TypedInt=1

The same thing happens with the WinEvent object

$WinEvents = Get-WinEvent -LogName 'Windows PowerShell' -MaxEvents 5

PS> $WinEvents | Get-Member

   TypeName: System.Diagnostics.Eventing.Reader.EventLogRecord

Name                 MemberType   Definition
----                 ----------   ----------
Dispose              Method       void Dispose(), void IDisposable.Dispose()
Equals               Method       bool Equals(System.Object obj)
FormatDescription    Method       string FormatDescription(), string FormatDescription(System.Collections.Generic.IEnu…
GetHashCode          Method       int GetHashCode()
GetPropertyValues    Method       System.Collections.Generic.IList[System.Object] GetPropertyValues(System.Diagnostics…
GetType              Method       type GetType()
ToString             Method       string ToString()
ToXml                Method       string ToXml()
Message              NoteProperty string Message=Engine state is changed from None to Available. …
ActivityId           Property     System.Nullable[guid] ActivityId {get;}
Bookmark             Property     System.Diagnostics.Eventing.Reader.EventBookmark Bookmark {get;}
ContainerLog         Property     string ContainerLog {get;}
Id                   Property     int Id {get;}
Keywords             Property     System.Nullable[long] Keywords {get;}
KeywordsDisplayNames Property     System.Collections.Generic.IEnumerable[string] KeywordsDisplayNames {get;}
Level                Property     System.Nullable[byte] Level {get;}
LevelDisplayName     Property     string LevelDisplayName {get;}
LogName              Property     string LogName {get;}
MachineName          Property     string MachineName {get;}
MatchedQueryIds      Property     System.Collections.Generic.IEnumerable[int] MatchedQueryIds {get;}
Opcode               Property     System.Nullable[short] Opcode {get;}
OpcodeDisplayName    Property     string OpcodeDisplayName {get;}
ProcessId            Property     System.Nullable[int] ProcessId {get;}
Properties           Property     System.Collections.Generic.IList[System.Diagnostics.Eventing.Reader.EventProperty] P…
ProviderId           Property     System.Nullable[guid] ProviderId {get;}
ProviderName         Property     string ProviderName {get;}
Qualifiers           Property     System.Nullable[int] Qualifiers {get;}
RecordId             Property     System.Nullable[long] RecordId {get;}
RelatedActivityId    Property     System.Nullable[guid] RelatedActivityId {get;}
Task                 Property     System.Nullable[int] Task {get;}
TaskDisplayName      Property     string TaskDisplayName {get;}
ThreadId             Property     System.Nullable[int] ThreadId {get;}
TimeCreated          Property     System.Nullable[datetime] TimeCreated {get;}
UserId               Property     System.Security.Principal.SecurityIdentifier UserId {get;}
Version              Property     System.Nullable[byte] Version {get;}

The System.Diagnostics.Eventing.Reader.EventLogRecord contains all sorts of funky types, that will not survive the conversion to System.String:

PS> $WinEvents | ConvertTo-Csv | Get-Member

   TypeName: System.String

Name                 MemberType            Definition
----                 ----------            ----------
Clone                Method                System.Object Clone(), System.Object ICloneable.Clone()
CompareTo            Method                int CompareTo(System.Object value), int CompareTo(string strB), int ICompar…
Contains             Method                bool Contains(string value), bool Contains(string value, System.StringCompa…
CopyTo               Method                void CopyTo(int sourceIndex, char[] destination, int destinationIndex, int …
EndsWith             Method                bool EndsWith(string value), bool EndsWith(string value, System.StringCompa…
EnumerateRunes       Method                System.Text.StringRuneEnumerator EnumerateRunes()
Equals               Method                bool Equals(System.Object obj), bool Equals(string value), bool Equals(stri…
GetEnumerator        Method                System.CharEnumerator GetEnumerator(), System.Collections.IEnumerator IEnum…
GetHashCode          Method                int GetHashCode(), int GetHashCode(System.StringComparison comparisonType)
GetPinnableReference Method                System.Char&, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, Pub…
GetType              Method                type GetType()
GetTypeCode          Method                System.TypeCode GetTypeCode(), System.TypeCode IConvertible.GetTypeCode()
IndexOf              Method                int IndexOf(char value), int IndexOf(char value, int startIndex), int Index…
IndexOfAny           Method                int IndexOfAny(char[] anyOf), int IndexOfAny(char[] anyOf, int startIndex),…
Insert               Method                string Insert(int startIndex, string value)
IsNormalized         Method                bool IsNormalized(), bool IsNormalized(System.Text.NormalizationForm normal…
LastIndexOf          Method                int LastIndexOf(char value), int LastIndexOf(char value, int startIndex), i…
LastIndexOfAny       Method                int LastIndexOfAny(char[] anyOf), int LastIndexOfAny(char[] anyOf, int star…
Normalize            Method                string Normalize(), string Normalize(System.Text.NormalizationForm normaliz…
PadLeft              Method                string PadLeft(int totalWidth), string PadLeft(int totalWidth, char padding…
PadRight             Method                string PadRight(int totalWidth), string PadRight(int totalWidth, char paddi…
Remove               Method                string Remove(int startIndex, int count), string Remove(int startIndex)
Replace              Method                string Replace(string oldValue, string newValue, bool ignoreCase, culturein…
ReplaceLineEndings   Method                string ReplaceLineEndings(), string ReplaceLineEndings(string replacementTe…
Split                Method                string[] Split(char separator, System.StringSplitOptions options = System.S…
StartsWith           Method                bool StartsWith(string value), bool StartsWith(string value, System.StringC…
Substring            Method                string Substring(int startIndex), string Substring(int startIndex, int leng…
ToBoolean            Method                bool IConvertible.ToBoolean(System.IFormatProvider provider)
ToByte               Method                byte IConvertible.ToByte(System.IFormatProvider provider)
ToChar               Method                char IConvertible.ToChar(System.IFormatProvider provider)
ToCharArray          Method                char[] ToCharArray(), char[] ToCharArray(int startIndex, int length)
ToDateTime           Method                datetime IConvertible.ToDateTime(System.IFormatProvider provider)
ToDecimal            Method                decimal IConvertible.ToDecimal(System.IFormatProvider provider)
ToDouble             Method                double IConvertible.ToDouble(System.IFormatProvider provider)
ToInt16              Method                short IConvertible.ToInt16(System.IFormatProvider provider)
ToInt32              Method                int IConvertible.ToInt32(System.IFormatProvider provider)
ToInt64              Method                long IConvertible.ToInt64(System.IFormatProvider provider)
ToLower              Method                string ToLower(), string ToLower(cultureinfo culture)
ToLowerInvariant     Method                string ToLowerInvariant()
ToSByte              Method                sbyte IConvertible.ToSByte(System.IFormatProvider provider)
ToSingle             Method                float IConvertible.ToSingle(System.IFormatProvider provider)
ToString             Method                string ToString(), string ToString(System.IFormatProvider provider), string…
ToType               Method                System.Object IConvertible.ToType(type conversionType, System.IFormatProvid…
ToUInt16             Method                ushort IConvertible.ToUInt16(System.IFormatProvider provider)
ToUInt32             Method                uint IConvertible.ToUInt32(System.IFormatProvider provider)
ToUInt64             Method                ulong IConvertible.ToUInt64(System.IFormatProvider provider)
ToUpper              Method                string ToUpper(), string ToUpper(cultureinfo culture)
ToUpperInvariant     Method                string ToUpperInvariant()
Trim                 Method                string Trim(), string Trim(char trimChar), string Trim(Params char[] trimCh…
TrimEnd              Method                string TrimEnd(), string TrimEnd(char trimChar), string TrimEnd(Params char…
TrimStart            Method                string TrimStart(), string TrimStart(char trimChar), string TrimStart(Param…
TryCopyTo            Method                bool TryCopyTo(System.Span[char] destination)
Chars                ParameterizedProperty char Chars(int index) {get;}
Length               Property              int Length {get;}

When the CSV data is converted back to PSObjects (Import-Csv), PowerShell tries to match the nearest probable data type for each field, as can be seen here:

PS> ConvertFrom-Csv $CSVObjects | Get-Member

TypeName: System.Management.Automation.PSCustomObject

Name                 MemberType   Definition
----                 ----------   ----------
Equals               Method       bool Equals(System.Object obj)
GetHashCode          Method       int GetHashCode()
GetType              Method       type GetType()
ToString             Method       string ToString()
ActivityId           NoteProperty string ActivityId=
Bookmark             NoteProperty string Bookmark=System.Diagnostics.Eventing.Reader.EventBookmark
ContainerLog         NoteProperty string ContainerLog=Windows PowerShell
Id                   NoteProperty string Id=400
Keywords             NoteProperty string Keywords=36028797018963968
KeywordsDisplayNames NoteProperty string KeywordsDisplayNames=System.Collections.ObjectModel.ReadOnlyCollection`1[Syst…
Level                NoteProperty string Level=4
LevelDisplayName     NoteProperty string LevelDisplayName=Information
LogName              NoteProperty string LogName=Windows PowerShell
MachineName          NoteProperty string MachineName=Ax-Laptop6
MatchedQueryIds      NoteProperty string MatchedQueryIds=System.UInt32[]
Message              NoteProperty string Message=Engine state is changed from None to Available. …
Opcode               NoteProperty string Opcode=0
OpcodeDisplayName    NoteProperty string OpcodeDisplayName=Info
ProcessId            NoteProperty string ProcessId=20324
Properties           NoteProperty string Properties=System.Collections.Generic.List`1[System.Diagnostics.Eventing.Read…
ProviderId           NoteProperty string ProviderId=
ProviderName         NoteProperty string ProviderName=PowerShell
Qualifiers           NoteProperty string Qualifiers=0
RecordId             NoteProperty string RecordId=3437
RelatedActivityId    NoteProperty string RelatedActivityId=
Task                 NoteProperty string Task=4
TaskDisplayName      NoteProperty string TaskDisplayName=Engine Lifecycle
ThreadId             NoteProperty string ThreadId=0
TimeCreated          NoteProperty string TimeCreated=21/09/2024 07.38.33
UserId               NoteProperty string UserId=
Version              NoteProperty string Version=0

It is all these strings that gets exported to Parquet in your initail scenario.

The solution is to curate the data in another way before exporting them to parquet. Export-Parquet also has to do some type conversion, before data can be stored. I am using best effort on the most common data types, so if you stick to simple types, you should be good.

In the help documentation for Export-Parquet you find this example:

    $File = "C:\Temp\Test.parquet"
    $data = 1..100 | foreach {
        [pscustomobject]@{
            Date        = (Get-Date).AddHours($_)
            Int32       = $_
            TypedInt    = [int]$_
            IntWithNull = (($_ % 3 -eq 0) ? $null : $_)
            Text        = "Iteration $_"
        }
    }
    $data | Export-Parquet -FilePath $File -Force

When you import data from this example, the types will be the same as the exported types.

Getting your System.Diagnostics.Eventing.Reader.EventLogRecord converted to a PSCustomObject that has the correct types for proper storage is a Data Engineering task. You have chosen to store the System.Diagnostics.Eventing.Reader.EventLogRecord in parquet and now you need to turn that object into something that Parquet can handle.

First, you need to evaluate where in the data flow you want to intervene. Exporting the events to csv degredates the data tremendously, so you might want to choose a different format for step number 1. Export-CliXmlis capable of storing much more exact information about the System.Diagnostics.Eventing.Reader.EventLogRecord, but it also takes a lot more diskspace.

If you are allowed to install PowerShell (not Windows PowerShell) on the machine that runs the Get-WinEvent command, you can create the parquet files directly with very little effort.

First you need to get rid of the most irrelevant information from the System.Diagnostics.Eventing.Reader.EventLogRecord object. That can be done by using Select-Object like this

# Get all properties 
$SelectorString = $WinEvents | Get-Member -MemberType Property, NoteProperty | Select-Object -ExpandProperty Name | Join-String -Separator ','
# The selectorstring variable contains all the properties of the System.Diagnostics.Eventing.Reader.EventLogRecord object
PS> $SelectorString
Message,ActivityId,Bookmark,ContainerLog,Id,Keywords,KeywordsDisplayNames,Level,LevelDisplayName,LogName,MachineName,MatchedQueryIds,Opcode,OpcodeDisplayName,ProcessId,Properties,ProviderId,ProviderName,Qualifiers,RecordId,RelatedActivityId,Task,TaskDisplayName,ThreadId,TimeCreated,UserId,Version

If you do not need them all for your further work, you can just leave them out. The properties you are left with, are the properties you need to manitplate to fit the parquet format.

A short exampele shows the work, that needs to be done:

$SelectedParameters = $WinEvents | Select-Object Id,TimeCreated,Bookmark,Message

Now there is 4 data types to work with:

PS> $SelectedParameters | Get-Member

   TypeName: Selected.System.Diagnostics.Eventing.Reader.EventLogRecord

Name        MemberType   Definition
Name        MemberType   Definition
----        ----------   ----------
Equals      Method       bool Equals(System.Object obj)
GetHashCode Method       int GetHashCode()
GetType     Method       type GetType()
ToString    Method       string ToString()
Bookmark    NoteProperty EventBookmark Bookmark=System.Diagnostics.Eventing.Reader.EventBookmark
Id          NoteProperty int Id=400
Message     NoteProperty string Message=Engine state is changed from None to Available. …
TimeCreated NoteProperty datetime TimeCreated=21/09/2024 07.38.33

We can try to export this to parquet straight away, but that will give us an error:

$SelectedParameters | Export-Parquet C:\Temp\SelectedParameters.parquet
Operation failed: type System.Object is not supported
WARNING: InputObjects contains unsupported values. Transform the data prior to running Export-Parquet.

And there is the task we need to attend to. Bookmark is an object and needs to be converted to something else before storing it in parquet.

The Data Engineer would first evaluate the contents of the Bookmark property. In this case, it looks like this:

PS> $SelectedParameters.Bookmark.BookmarkXml
<BookmarkList Direction='backward'>
  <Bookmark Channel='Windows PowerShell' RecordId='3437' IsCurrent='true'/>
</BookmarkList>

After having looked at a few more of the Bookmark parameter contents, the Data Engineer would decide, whether to extract some of the data in the object, keep the object type as a string or leave it out entirely.

Let's leave it out here, since the objective is to get the datetime data in parquet

$SelectedParameters = $WinEvents | Select-Object Id,TimeCreated,Message
$SelectedParameters | Export-Parquet C:\Temp\EventData.parquet

PS> Import-Parquet C:\Temp\EventData.parquet | Get-Member

TypeName: System.Management.Automation.PSCustomObject

Name        MemberType   Definition
----        ----------   ----------
Equals      Method       bool Equals(System.Object obj)
GetHashCode Method       int GetHashCode()
GetType     Method       type GetType()
ToString    Method       string ToString()
Id          NoteProperty int Id=400
Message     NoteProperty string Message=Engine state is changed from None to Available. …
TimeCreated NoteProperty datetime TimeCreated=21/09/2024 05.38.33

Now the data can be exported to parquet and re-importing the data gives us the right type

Data Engineering is a valid and sometimes well paid occupation. Remember to tell your employer ;-)

Agazoth commented 1 month ago

@lboening thanks to your question, I found a bug in the module. That has been fixed now. If you update the module, you can do something like this:

$SelectedParameters = $WinEvents | Select-Object Id,TimeCreated,Message,@{n='Bookmark';e={$_.Bookmark | Convertto-json -Compress}}
$SelectedParameters | Export-Parquet C:\Temp\EventData.parquet

It is still required to transform complexe objects to simple objects, but with this fix you can convert almost any object to a json string. This hopefully makes your Data Engineer work easier.

lboening commented 1 month ago

Are you supposed to bump this:

Version number of this module.

ModuleVersion = '0.2.14'

I have been using Microsoft.PowerShell.PsResourceGet

PS C:\> find-psresource -Repository PsGallery -Type Module -name PsParquet

Name      Version Prerelease Repository Description
----      ------- ---------- ---------- -----------
PSParquet 0.2.14             PSGallery  Import and export objects to parquet format
Agazoth commented 1 month ago

Yes, that is the new version: https://www.powershellgallery.com/packages/PSParquet/0.2.14

Agazoth commented 1 month ago

@lboening this runs directly to parquet:

Get-WinEvent -LogName 'Windows PowerShell' | Select  -ExcludeProperty Bookmark, UserId, ActivityId,Properties,RelatedActivityId, ProviderId, KeywordsDisplayNames,MatchedQueryIds *,@{n="Bookmark"; e={$_.Bookmark.ToString()}},@{n="UserId"; e={$_.UserId | ConvertTo-Json}} ,@{n="ActivityId"; e={$_.ActivityId | ConvertTo-Json}},@{n="Properties"; e={$_.Properties | ConvertTo-Json}},@{n="RelatedActivityId"; e={$_.RelatedActivityId | ConvertTo-Json}},@{n="ProviderId"; e={$_.ProviderId | ConvertTo-Json}},@{n="KeywordsDisplayNames"; e={$_.KeywordsDisplayNames.ToString()}},@{n="MatchedQueryIds"; e={$_.MatchedQueryIds.ToString()}} | Export-Parquet c:\temp\test.parquet -Force
lboening commented 1 month ago
PS C:\> Get-WinEvent -LogName 'Windows PowerShell' | Select  -ExcludeProperty Bookmark, UserId, ActivityId,Properties,RelatedActivityId, ProviderId, KeywordsDisplayNames,MatchedQueryIds *,@{n="Bookmark"; e={$_.Bookmark.ToString()}},@{n="UserId"; e={$_.UserId | ConvertTo-Json}} ,@{n="ActivityId"; e={$_.ActivityId | ConvertTo-Json}},@{n="Properties"; e={$_.Properties | ConvertTo-Json}},@{n="RelatedActivityId"; e={$_.RelatedActivityId | ConvertTo-Json}},@{n="ProviderId"; e={$_.ProviderId | ConvertTo-Json}},@{n="KeywordsDisplayNames"; e={$_.KeywordsDisplayNames.ToString()}},@{n="MatchedQueryIds"; e={$_.MatchedQueryIds.ToString()}} | Export-Parquet c:\temp\test.parquet -Force
Operation failed: The input string 'System.Diagnostics.Eventing.Reader.EventBookmark' was not in a correct format.
WARNING: InputObjects contains unsupported values. Transform the data prior to running Export-Parquet.
PS C:\>

It ran better when I ran PowerShell as admin.

lboening commented 1 month ago

An error for the BookMark. Forgive but I have seen the "bookmark" and I do not appreciate it. Is this a valid link to an explanation: https://learn.microsoft.com/en-us/windows/win32/wes/bookmarking-events

Agazoth commented 1 month ago

Bookmark is in XML format. Turning it to a string with ToString() should leave you with the type name. In effect, you will only save the type name to parquet (like when you evport to csv). If you do not need the Bookmark data, you can leave it out entirely. If you do need it, you should write code to extract the relevant information from it.

Agazoth commented 1 month ago
PS C:\> Get-WinEvent -LogName 'Windows PowerShell' | Select  -ExcludeProperty Bookmark, UserId, ActivityId,Properties,RelatedActivityId, ProviderId, KeywordsDisplayNames,MatchedQueryIds *,@{n="Bookmark"; e={$_.Bookmark.ToString()}},@{n="UserId"; e={$_.UserId | ConvertTo-Json}} ,@{n="ActivityId"; e={$_.ActivityId | ConvertTo-Json}},@{n="Properties"; e={$_.Properties | ConvertTo-Json}},@{n="RelatedActivityId"; e={$_.RelatedActivityId | ConvertTo-Json}},@{n="ProviderId"; e={$_.ProviderId | ConvertTo-Json}},@{n="KeywordsDisplayNames"; e={$_.KeywordsDisplayNames.ToString()}},@{n="MatchedQueryIds"; e={$_.MatchedQueryIds.ToString()}} | Export-Parquet c:\temp\test.parquet -Force
Operation failed: The input string 'System.Diagnostics.Eventing.Reader.EventBookmark' was not in a correct format.
WARNING: InputObjects contains unsupported values. Transform the data prior to running Export-Parquet.
PS C:\>

It ran better when I ran PowerShell as admin.

I think the new runspace you got when runnig as admin did it.

Agazoth commented 1 month ago

Closing this. Thanks for trying the module.