Open zilto opened 3 months ago
I can make the PR for the changes, just let me know what direction to go!
Hey @zilto good point! I think the reason it was done this way -- so we have an incremental field. But string could also be incremental, so I agree with your idea of converting both to strings.
Thanks a lot for the suggestion to create a PR, this would be wonderful! Just put me as a reviewer :)
I started making these changed to open a PR, but I encountered another challenge.
In a few places, we have dlt.sources.incremental()
throwing an exception because it can't compare ts
of type str
and created_at
of type DateTime
.
created_at: dlt.sources.incremental[DateTime] = dlt.sources.incremental(
"ts",
initial_value=start_dt,
end_value=end_dt,
allow_external_schedulers=True,
),
I envision 2 solutions:
MSG_DATETIME_FIELDS
to get_thread_replies()
to make everything ts
and thread_ts
consistently a timestamp. dlt.transformer()
?) to change the type of ts
and thread_ts
to str
after comparisons for incremental loading are done.I could implement solution 1, but would need some additional guidance for solution 2. Let me know how to proceed
Hi @zilto! I think what we can do -- is to create the variable in str type right away:
created_at: dlt.sources.incremental[str] = dlt.sources.incremental(
"ts",
initial_value=start_dt,
end_value=end_dt,
allow_external_schedulers=True,
),
Or make everything a timestamp, as you've suggested in 1 solution.
If you have some code ready -- could you open a PR? Or if you don't have a time to continue with this, just let me know, we'll assign someone on this issue!
dlt version
0.4.7
Source name
slack
Describe the problem
Values by
get_messages()
andget_thread_replies()
don't return the same data types for fieldts
andthread_ts
. Values are returned astimestamp
for the first andstring
for the latter.This is problematic when trying to join tables of messages and replies based on their
thread_ts
(thread id), which is a very common operation.This is because
get_messages()
passesdatetime_fields=MSG_DATETIME_FIELDS
whereasget_thread_replies()
doesn't.Expected behavior
ts
andthread_ts
should both receive the same type fromMSG_DATETIME_FIELDS
More importantly, according to Slack specs,
ts
andthread_ts
are not timestamps andstring
is actually the proper type. (see ref)Given
ts
andthread_ts
do not exactly represent a timestamp but rather are unique ids that can be sorted chronologically, I just removing them from the default values ofMSG_DATETIME_FIELDS
.This would be a breaking change for the
message
tables, but not forreplies
tables, so it would the right time to introduce the change to defaults if accepted.Steps to reproduce
dlt init slack
How you are using the source?
I run this source for fun.
Operating system
Linux
Runtime environment
Local
Python version
3.10.9
dlt destination
duckdb
Additional information
As a solution, I manually change type of
ts
andthread_ts
of messages fromtimestamp
tostring