billgraziano / xelogstash

Send SQL Server Extended Events to Logstash, Elastic Search, or JSON
Other
26 stars 11 forks source link

invalid value: domain: on sql server running on linux/ubuntu server:2019-GA-ubuntu-16.04 #62

Closed Mohkam closed 3 years ago

Mohkam commented 3 years ago

when trying to get extended event from ubuntu based sql server in kubernetes , i get error invalid value domain...

root@MYPOD_NAME:/mnt/artifacts/sqlxewriter/linux# ./sqlxewriter
INFO[0000] start: version: dev; git: dev; build:
INFO[0000] config file: /mnt/artifacts/sqlxewriter/linux/sqlxewriter.toml
INFO[0000] default poll interval: 60s
INFO[0000] pprof available at http://localhost:6061/debug/pprof
INFO[0000] expvars available at http://localhost:6061/debug/vars
INFO[0000] metrics available at http://localhost:6061/debug/metrics
INFO[0000] sources: 1; default rows: 100
INFO[0000] sink: files: /mnt/artifacts/sqlxewriter/linux/events/sqlevents_YYYYMMDD_HH.events (keep: 24h0m0s)
INFO[0000] sink: elastic: http://10.10.10.1:9200
INFO[0000] 10.10.10.100: polling interval: 60s           source=10.10.10.100
ERRO[0000] skipping duplicate: fqdn: 10.10.10.100; domain: ; server: azwu2v1-dev-rr-: invalid value: domain: ; instance: azwu2v1-dev-rr-;
INFO[0000] metrics: alloc: 5.3mb; sys: 71.1mb; goroutines: 9; uptime: 0m0s; sources: 1
INFO[0000] watching configuration file: /mnt/artifacts/sqlxewriter/linux/sqlxewriter.toml
INFO[0000] stop signal received
INFO[0000] all sinks closed
INFO[0000] application stopping

Here is my sample SQL pod deployment


apiVersion: apps/v1
kind: Deployment
metadata:
  name: mypodname
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mypodname
  template:
    metadata:
     labels:
        app: mypodname
    spec:
      hostname: 
      terminationGracePeriodSeconds: 10
      containers:
      - name: mypodname
        image: mcr.microsoft.com/mssql/server:2019-GA-ubuntu-16.04
        resources:
          limits:
            cpu: "7"
            memory: 48Gi
          requests:
            cpu: "0.2"
            memory: 1Gi
        ports:
        - containerPort: 1433
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          allowPrivilegeEscalation: true
        env:
        - name: MSSQL_PID
          value: "Developer"
        - name: ACCEPT_EULA
          value: "Y"
        - name: MSSQL_SA_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mssql
              key: SA_PASSWORD
        - name: MSSQL_BACKUP_DIR
          value: "/usr/share/sqlbackup" 
        volumeMounts:
        - name: mssqldb
          mountPath: /var/opt/mssql
        - name: mssqlbackupdir
          mountPath: /usr/share/sqlbackup 
        - name: reefreview-dev-dapi-artifacts-storage
          mountPath: "/mnt/artifacts"
      volumes:
      - name: mssqldb
        persistentVolumeClaim:
          claimName: mypodname-pvc
      - name: mssqlbackupdir
        persistentVolumeClaim:
            claimName: rrwu2v1-sqlbackup-pvc
      - name: reefreview-dev-dapi-artifacts-storage
        azureFile:
          secretName: rrwu2v1lcnsa1-fileshare-secret
          shareName: reefreview-drive-api-storage\artifacts
          readOnly: false
      nodeSelector: 
              #"beta.kubernetes.io/os": linux
              "agentpool": sqlpool1
      tolerations:
      - key: "pod"
        operator: "Equal"
        value: "mssql"
        effect: "NoSchedule"        

---

apiVersion: v1
kind: Service
metadata:
  name: mypodname-svc
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "rrwu2v1-kubernetes-cluster-subnet"
spec:
  selector:
    app: mypodname
  ports:
    - protocol: TCP
      port: 1433
      targetPort: 1433
  type: LoadBalancer

Would you please guide if there is anyway to override domain name, since sql is running in linux. Or any help on how i can use it for linux based sql servers as well?

Thank you so much for this amazing effort.

Mohkam commented 3 years ago

image image

billgraziano commented 3 years ago

Can you post your sqlxewriter.toml (and please clean up any passwords)?

Can you also post the result of this query:

SELECT  @@SERVERNAME AS [ServerName]
    ,COALESCE(DEFAULT_DOMAIN(), '') AS [DomainName]
    ,COALESCE(CAST(SERVERPROPERTY('MachineName') as nvarchar(128)), @@SERVERNAME) AS [Computer]
    ,CAST(COALESCE(SERVERPROPERTY('ProductLevel'), '') as nvarchar(128)) AS ProductLevel
    ,COALESCE(CAST(SERVERPROPERTY('ProductMajorVersion') as NVARCHAR(128))  + '.' + CAST(SERVERPROPERTY('ProductMinorVersion') as NVARCHAR(128)),'') AS ProductRelease
    ,COALESCE(CAST(SERVERPROPERTY('ProductVersion') AS NVARCHAR(128)), '') as [ProductVersion];
billgraziano commented 3 years ago

Ah, I see the issue. Give me just a minute. I don't have a great Linux test environment.

Mohkam commented 3 years ago

Thank you so much! it returns nothing for domain,

image

Mohkam commented 3 years ago

My TOML:

[elastic]
addresses = ["http://X:X:x:x:9200/"]
username = "xxxx"
password = "xxxxxxx"

auto_create_indexes = true
default_index = "dev-sql-extended-events"

[filesink]
retain_hours = 4000

[app]
http_metrics = true
http_metrics_port = 6061
watch_config = true

# adds = [    "global.host:$(HOST)",  # $(VAR) must be upper case.  the resulting value is lower case.
#             "global.app:$(EXE)" 
#             ]
# copies = [  "@timestamp:newtimestamp" ]
# moves = [ ] 

[defaults]
sessions = ["system_health", "AlwaysOn_health"]
look_back = "1680000h" # one week
rows = 20000 # Process batches of 20000 at a time.  a good compromise.  0 or omitted processes all the rows

timestamp_field_name = "@timestamp"
payload_field_name = "mssql" # all the XE events are under this field

agentjobs = "all" # (all|failed|none) - process SQL Server agent job history
excludedEvents = [
    "connectivity_ring_buffer_recorded",
    "memory_broker_ring_buffer_recorded",
    "scheduler_monitor_system_health_ring_buffer_recorded",
    "security_error_ring_buffer_recorded"
    ]

[[source]]
fqdn = "x:x:x:x"
user = "sa"
password = "xxxxxx"
billgraziano commented 3 years ago

I just pushed a fix. It looks like you are building locally? If not, I can also try to deploy a new binary -- but that may be a little later.

Mohkam commented 3 years ago

I am building locally, i will share the results shortly.

Mohkam commented 3 years ago

I am sorry if I am requesting too much. I am facing same error but on second and onward instances. I believe the reason is that Server name of SQL (as on windows deployment) truncates to 15 characters (NetBIOS name limit). even on Linux :( . Would it be possible to add an optional tie breaker in TOML config and make it part of duplicate detection logic, something like:

[[source]]
fqdn = "x:x:x:x"
UniqueID = "201"

so its unique identity in duplicate logic would be CurrentID(ServerName maybe) +UniqueID

My Tomel (pasting delta wrt previous message) :

[[source]]
fqdn = "x:x:x:201"
user = "sa"
password = "xxxxxx"
[[source]]
fqdn = "x:x:x:202"
user = "sa"
password = "xxxxxx"
[[source]]
fqdn = "x:x:x:203"
user = "sa"
password = "xxxxxx"

LOG:

TRAC[15552] app.program.startpolling...
DEBU[15552] desired config file: D:\GitHub\xelogstash\deploy\windows\sqlxewriter\sqlxewriter.toml
DEBU[15552] checking sources file: D:\GitHub\xelogstash\deploy\windows\sqlxewriter\sqlxewriter_sources.toml
INFO[15552] config file: D:\GitHub\xelogstash\deploy\windows\sqlxewriter\sqlxewriter.toml
INFO[15552] default poll interval: 60s
INFO[15552] pprof available at http://localhost:6061/debug/pprof
INFO[15552] expvars available at http://localhost:6061/debug/vars
INFO[15552] metrics available at http://localhost:6061/debug/metrics
INFO[15552] sources: 4; default rows: 100
INFO[15553] sink: files: D:\GitHub\xelogstash\deploy\windows\sqlxewriter\events\sqlevents_YYYYMMDD_HH.events (keep: 240h0m0s)
INFO[15553] sink: elastic: http://10.10.10.207:9200
TRAC[15554] loop: true
TRAC[15554] poll routine launched: windows-service        source=10.10.10.201
INFO[15554] 10.10.10.201: polling interval: 60s          source=10.10.10.201
TRAC[15554] poll routine launched: windows-service        source=10.10.10.203
TRAC[15554] poll routine launched: windows-service        source=10.10.10.204
TRAC[15554] poll routine launched: windows-service        source=10.10.10.202
TRAC[15560] source: 10.10.10.201; polling (#1)...        source=10.10.10.201
TRAC[15560] entering ProcessSource                        source=10.10.10.201
TRAC[15560] source: 10.10.10.201;  sessions: 3;  events after: 2019-05-30T01:10:32+05:00  source=10.10.10.201
DEBU[15560] user: sa                                      source=10.10.10.201
TRAC[15566] tpsj-dev-sj-mss: sys.messages for login_failed: 37  instance=tpsj-dev-sj-mss source=10.10.10.201
TRAC[15566] 10.10.10.201: starting: session: system_health  instance=tpsj-dev-sj-mss session=system_health source=10.10.10.201
INFO[15568] [0] Source: tpsj-dev-sj-mss (system_health);  'Start At' skipped at least one event
DEBU[15568] tpsj-dev-sj-mss () session: system_health; events: 4; per_second: 1  duration_ms=2241 events=4 events_per_sec=1 instance=tpsj-dev-sj-mss session=system_health source=10.10.10.201
TRAC[15568] 10.10.10.201: starting: session: activity_tracking  instance=tpsj-dev-sj-mss session=activity_tracking source=10.10.10.201
INFO[15569] 10.10.10.202: polling interval: 60s          source=10.10.10.202
TRAC[15569] 10.10.10.201: starting: session: telemetry_xevents  instance=tpsj-dev-sj-mss session=telemetry_xevents source=10.10.10.201
ERRO[15569] source: 10.10.10.201 () - XE - telemetry_xevents : validatesession: no file target  instance=tpsj-dev-sj-mss session=telemetry_xevents source=10.10.10.201
DEBU[15571] tpsj-dev-sj-mss () session: agent_jobs; events: 0  duration_ms=2511 events=0 events_per_sec=0 instance=tpsj-dev-sj-mss session=agent_jobs source=10.10.10.201
ERRO[15571] instance: tpsj-dev-sj-mss; err: errors occurred - see previous  source=10.10.10.201
ERRO[15575] skipping duplicate: fqdn: '10.10.10.202'; domain: ''; server: 'tpsj-dev-sj-mss': duplicate domain-instance
INFO[15584] 10.10.10.203: polling interval: 60s          source=10.10.10.203
ERRO[15590] skipping duplicate: fqdn: '10.10.10.203'; domain: ''; server: 'tpsj-dev-sj-mss': duplicate domain-instance
INFO[15599] 10.10.10.204: polling interval: 60s          source=10.10.10.204
ERRO[15606] skipping duplicate: fqdn: '10.10.10.204'; domain: ''; server: 'tpsj-dev-sj-mss': duplicate domain-instance
INFO[15614] metrics: alloc: 5.0mb; sys: 24.7mb; goroutines: 21; uptime: 4h20m; sources: 4

Details that are specific to my case, i am not sure if i should mention it:

billgraziano commented 3 years ago

Well that is annoying. But good job figuring out what it is doing 👍

I'm not sure it's that simple. I use the results of @@SERVERNAME to populate quite a few fields in there. And without that, it will be hard to separate these boxes. Let's see what data we do have. Can you run the following two queries on one or two of the servers and post the results?

SELECT COALESCE(serverproperty('InstanceName'), '<NULL>') AS InstanceName
    , COALESCE(serverproperty('MachineName') , '<NULL>') AS MachineName
    , COALESCE(serverproperty('ServerName') , '<NULL>') AS ServerName

select [name] from sys.servers where server_id = 0 
Mohkam commented 3 years ago

Please excuse my ignorance of the matter. Here are snapshots of 2 of them. Although all 4 of my test env show the same. image image

billgraziano commented 3 years ago

Your screen shots were perfect. I just upload new code that supports the following. For each source, you can define two additional fields:

I don't think you need to set the domain name override. But the server name override will allow you set the value that SQL Server returns as the server name. That should give you enough flexibility for most any situation. Let me know if that works.

Mohkam commented 3 years ago

Thank you so much - server_name_override has helped me out. Now that i have it running for me. I will go through the extracted events.

billgraziano commented 3 years ago

Great! I'm glad that worked for you.