Closed avirajsingh7 closed 2 weeks ago
@airbytehq/dev-marketplace-contributions can someone take a look in this issue?
I have the same issue. My substream runs for every item in the parent stream but only the first page of the parent stream.
Mentioned on Community Slack: https://airbytehq-team.slack.com/archives/C021JANJ6TY/p1720192514494299
We're looking, but no ETA yet.
@Stockotaco @avirajsingh7 I have a hunch — does it matter if the substream OR the parent stream are full refresh or incremental? If incremental is possible and you switch to incremental, does the problem go away?
In my case both parent and sub-streams are full refresh syncs, not incremental.
Incremental won't work in my case.
@natikgadzhi both are full refresh, #40573 is also a same issue, looks like bug
Oh, I think we have a fix for this here: https://github.com/airbytehq/airbyte/pull/40671
@ChristoGrab, can you please test Builder locally with CDK with this patch applied, and confirm this works for Builder as well? If yes, also review and approve @brianjlai's pull request, and let's make sure to ship it on Monday.
@natikgadzhi @brianjlai this doesn't seem to be fixed in 0.64.8.
There's a pagination limit and a record filter to help with the issue, but some of the data randomly doesn't get synced. I assume the filter is applied after the pagination?
request_parameters:
count: '100'
active: 'true'
organization: '{{ config["organization_url"] }}'
...
record_filter:
type: RecordFilter
condition: '{{ record["profile"]["type"] == "Team" }}'
Also, I have a feeling that the child stream always gets synced first. Not sure how that is possible without syncing the parent stream.
Try again — released fresh Cloud Builder update yesterday AFAIK.
@natikgadzhi thanks a bunch, it works correctly now!
Phew! You’re very welcome ;-)
I'm still seeing this issue with Airbyte version 0.63.9
. I have four sub-streams that all pull data from the same parent stream. During synchronization, the sub-streams only retrieve records from the first page of the parent stream. However, when syncing the parent stream directly, it successfully fetches records from all pages. It seems that the sub-streams are limited to retrieving only the first page's worth of data.
I'm not sure if I am missing something in my configurations. Here is my manifest.yml
:
version: 3.8.2
type: DeclarativeSource
check:
type: CheckStream
stream_names:
- end_users
definitions:
streams:
end_users:
type: DeclarativeStream
name: end_users
primary_key:
- end_user_id
retriever:
type: SimpleRetriever
requester:
$ref: '#/definitions/base_requester'
path: api/end_users
http_method: GET
request_parameters:
limit: '100'
order_by: last_updated_desc
error_handler:
type: CompositeErrorHandler
error_handlers:
- type: DefaultErrorHandler
max_retries: 10
backoff_strategies:
- type: ExponentialBackoffStrategy
factor: 2
record_selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path:
- end_users
paginator:
type: DefaultPaginator
page_token_option:
type: RequestOption
inject_into: request_parameter
field_name: page
pagination_strategy:
type: PageIncrement
start_from_page: 1
transformations:
- type: RemoveFields
field_pointers:
- - profit_and_loss_layout
schema_loader:
type: InlineSchemaLoader
schema:
$ref: '#/schemas/end_users'
profit_and_loss:
type: DeclarativeStream
name: profit_and_loss
primary_key:
- end_user_id
retriever:
type: SimpleRetriever
requester:
$ref: '#/definitions/base_requester'
path: >-
api/end_users/{{ stream_partition.end_user_id_or_heron_id
}}/profit_and_loss
http_method: GET
error_handler:
type: CompositeErrorHandler
error_handlers:
- type: DefaultErrorHandler
max_retries: 10
backoff_strategies:
- type: ExponentialBackoffStrategy
factor: 2
record_selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path: []
partition_router:
type: SubstreamPartitionRouter
parent_stream_configs:
- type: ParentStreamConfig
parent_key: end_user_id
partition_field: end_user_id_or_heron_id
stream:
$ref: '#/definitions/streams/end_users'
transformations:
- type: AddFields
fields:
- path:
- end_user_id
value: '{{ stream_partition.end_user_id_or_heron_id }}'
schema_loader:
type: InlineSchemaLoader
schema:
$ref: '#/schemas/profit_and_loss'
bank_statement_summary_by_month:
type: DeclarativeStream
name: bank_statement_summary_by_month
primary_key:
- end_user_id
retriever:
type: SimpleRetriever
requester:
$ref: '#/definitions/base_requester'
path: >-
api/end_users/{{ stream_partition.end_user_id_or_heron_id
}}/bank_statement_summary
http_method: GET
request_parameters:
grouping: by_month
error_handler:
type: CompositeErrorHandler
error_handlers:
- type: DefaultErrorHandler
max_retries: 10
backoff_strategies:
- type: ExponentialBackoffStrategy
factor: 2
record_selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path: []
partition_router:
type: SubstreamPartitionRouter
parent_stream_configs:
- type: ParentStreamConfig
parent_key: end_user_id
partition_field: end_user_id_or_heron_id
stream:
$ref: '#/definitions/streams/end_users'
transformations:
- type: AddFields
fields:
- path:
- end_user_id
value: '{{ stream_partition.end_user_id_or_heron_id }}'
- type: AddFields
fields:
- path:
- grouping
value: '{{ record[''grouping''] }}'
- type: AddFields
fields:
- path:
- months
value: '{{ record[''by_month''] }}'
- type: RemoveFields
field_pointers:
- - average
- type: RemoveFields
field_pointers:
- - total
- type: RemoveFields
field_pointers:
- - by_month
schema_loader:
type: InlineSchemaLoader
schema:
$ref: '#/schemas/bank_statement_summary_by_month'
scorecard_metrics:
type: DeclarativeStream
name: scorecard_metrics
retriever:
type: SimpleRetriever
requester:
$ref: '#/definitions/base_requester'
path: >-
api/end_users/{{ stream_partition.end_user_id_or_heron_id
}}/scorecard
http_method: GET
error_handler:
type: CompositeErrorHandler
error_handlers:
- type: DefaultErrorHandler
max_retries: 10
backoff_strategies:
- type: ExponentialBackoffStrategy
factor: 2
- type: DefaultErrorHandler
response_filters:
- type: HttpResponseFilter
action: IGNORE
predicate: '{{ response.code == 400 }}'
http_codes:
- 400
error_message: End user hasn't been successfully enriched yet, skip
error_message_contains: End user hasn't been successfully enriched yet
record_selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path:
- metrics
partition_router:
type: SubstreamPartitionRouter
parent_stream_configs:
- type: ParentStreamConfig
parent_key: end_user_id
partition_field: end_user_id_or_heron_id
stream:
$ref: '#/definitions/streams/end_users'
transformations:
- type: AddFields
fields:
- path:
- end_user_id
value: '{{ stream_partition.end_user_id_or_heron_id }}'
schema_loader:
type: InlineSchemaLoader
schema:
$ref: '#/schemas/scorecard_metrics'
bank_statement_summary_by_data_source:
type: DeclarativeStream
name: bank_statement_summary_by_data_source
primary_key:
- end_user_id
retriever:
type: SimpleRetriever
requester:
$ref: '#/definitions/base_requester'
path: >-
api/end_users/{{ stream_partition.end_user_id_or_heron_id
}}/bank_statement_summary
http_method: GET
request_parameters:
grouping: by_data_source_account_heron_id
error_handler:
type: CompositeErrorHandler
error_handlers:
- type: DefaultErrorHandler
max_retries: 10
backoff_strategies:
- type: ExponentialBackoffStrategy
factor: 2
record_selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path: []
partition_router:
type: SubstreamPartitionRouter
parent_stream_configs:
- type: ParentStreamConfig
parent_key: end_user_id
partition_field: end_user_id_or_heron_id
stream:
$ref: '#/definitions/streams/end_users'
transformations:
- type: AddFields
fields:
- path:
- end_user_id
value: '{{ stream_partition.end_user_id_or_heron_id }}'
- type: AddFields
fields:
- path:
- grouping
value: '{{ record[''grouping''] }}'
- type: AddFields
fields:
- path:
- data_sources
value: '{{ record[''by_data_source_account_heron_id''] }}'
- type: RemoveFields
field_pointers:
- - average
- type: RemoveFields
field_pointers:
- - total
- type: RemoveFields
field_pointers:
- - by_data_source_account_heron_id
schema_loader:
type: InlineSchemaLoader
schema:
$ref: '#/schemas/bank_statement_summary_by_data_source'
base_requester:
type: HttpRequester
url_base: https://app.herondata.io
authenticator:
type: BasicHttpAuthenticator
password: '{{ config["password"] }}'
username: '{{ config["username"] }}'
streams:
- $ref: '#/definitions/streams/end_users'
- $ref: '#/definitions/streams/profit_and_loss'
- $ref: '#/definitions/streams/bank_statement_summary_by_month'
- $ref: '#/definitions/streams/scorecard_metrics'
- $ref: '#/definitions/streams/bank_statement_summary_by_data_source'
spec:
type: Spec
connection_specification:
type: object
$schema: http://json-schema.org/draft-07/schema#
required:
- username
properties:
username:
type: string
order: 0
title: Username
password:
type: string
order: 1
title: Password
always_show: true
airbyte_secret: true
additionalProperties: true
metadata:
autoImportSchema:
end_users: false
profit_and_loss: false
bank_statement_summary_by_month: false
scorecard_metrics: true
bank_statement_summary_by_data_source: false
I wonder whether re-releasing the connector after the upgrade could help? Somehow it didn't work for me from the first try either
@killthekitten weird as it is, if you have an OSS / Cloud connector published to your workspace, try republishing, and tell me if that worked.
I wonder whether re-releasing the connector after the upgrade could help? Somehow it didn't work for me from the first try either
Re-releasing the connector did the trick! Thank you for the tip!
Topic
Connector-Builder-Ui
Relevant information
We're encountering an issue with source connectors configured through the UI where substreams are only making API calls to records fetched from the first page of their parent streams. This means that even if the parent stream has a significant number of records (e.g., 23k in this case), the substream will only process parent_id from first_page record of parent_stream (e.g., 546 records).
I have verified the records on DB and last record of substream is from last record of parent stream fetched from first api call.
Here's manifest.yaml file, I have configured this to get record for each api_call in substream(For debugging purpose).