dlt-hub / verified-sources

Contribute to dlt verified sources 🔥
https://dlthub.com/docs/walkthroughs/add-a-verified-source
Apache License 2.0
50 stars 40 forks source link

Slack `ts` and `thread_ts` inconsistent types #412

Open zilto opened 3 months ago

zilto commented 3 months ago

dlt version

0.4.7

Source name

slack

Describe the problem

Values by get_messages() and get_thread_replies() don't return the same data types for field ts and thread_ts. Values are returned as timestamp for the first and string for the latter.

This is problematic when trying to join tables of messages and replies based on their thread_ts (thread id), which is a very common operation.

This is because get_messages() passes datetime_fields=MSG_DATETIME_FIELDS whereas get_thread_replies() doesn't.

Expected behavior

  1. ts and thread_ts should both receive the same type from MSG_DATETIME_FIELDS

  2. More importantly, according to Slack specs, ts and thread_ts are not timestamps and string is actually the proper type. (see ref)

There are a few additional fields that describe the author (such as user or bot_id), but there's also an additional ts field. The ts value is essentially the ID of the message, guaranteed unique within the context of a channel or conversation.

They look like UNIX/epoch timestamps, hence ts, with specified milliseconds. But they're actually message IDs, even if they're partially composed in seconds-since-the-epoch.

Given ts and thread_ts do not exactly represent a timestamp but rather are unique ids that can be sorted chronologically, I just removing them from the default values of MSG_DATETIME_FIELDS.

This would be a breaking change for the message tables, but not for replies tables, so it would the right time to introduce the change to defaults if accepted.

Steps to reproduce

dlt init slack

How you are using the source?

I run this source for fun.

Operating system

Linux

Runtime environment

Local

Python version

3.10.9

dlt destination

duckdb

Additional information

As a solution, I manually change type of ts and thread_ts of messages from timestamp to string

zilto commented 2 months ago

I can make the PR for the changes, just let me know what direction to go!

VioletM commented 2 months ago

Hey @zilto good point! I think the reason it was done this way -- so we have an incremental field. But string could also be incremental, so I agree with your idea of converting both to strings.

Thanks a lot for the suggestion to create a PR, this would be wonderful! Just put me as a reviewer :)

zilto commented 2 months ago

I started making these changed to open a PR, but I encountered another challenge.

In a few places, we have dlt.sources.incremental() throwing an exception because it can't compare ts of type str and created_at of type DateTime.

 created_at: dlt.sources.incremental[DateTime] = dlt.sources.incremental(
    "ts",
    initial_value=start_dt,
    end_value=end_dt,
    allow_external_schedulers=True,
),

line

I envision 2 solutions:

  1. pass MSG_DATETIME_FIELDS to get_thread_replies() to make everything ts and thread_ts consistently a timestamp.
  2. do post-processing (dlt.transformer()?) to change the type of ts and thread_ts to str after comparisons for incremental loading are done.

I could implement solution 1, but would need some additional guidance for solution 2. Let me know how to proceed

VioletM commented 1 month ago

Hi @zilto! I think what we can do -- is to create the variable in str type right away:

created_at: dlt.sources.incremental[str] = dlt.sources.incremental(
    "ts",
    initial_value=start_dt,
    end_value=end_dt,
    allow_external_schedulers=True,
),

Or make everything a timestamp, as you've suggested in 1 solution.

If you have some code ready -- could you open a PR? Or if you don't have a time to continue with this, just let me know, we'll assign someone on this issue!