Closed bazarnov closed 2 years ago
@alexandr-shegeda Please take a look, sounds like minor bug.
We are experiencing the same bug here with BQ destination. It's generating a massive amount of logs which is throttling the MinIO deployment.
Same with postgres (connectors/destination/postgres )
During the investigation, I found that the current bug was not related to either Python or Java Connectors.
The code which triggers the issue is located in airbyte-workers
:
It looks like io/airbyte/validation/json/JsonSchemaValidator.java
uses a library that has a regexp pattern for date-time
nodes.
The dependeny tree as follows:
airbyte-workers -> airbyte-json-validation -> com.networknt:json-schema-validator:1.0.42
The library is: https://github.com/airbytehq/airbyte/blob/9d1cd42ff9f3118e2312ea9c94ad647f1baaad73/airbyte-json-validation/build.gradle#L6-L6
And the line where the JsonSchemaFactory
instance is created:
The regexp used by the com.networknt:json-schema-validator:1.0.42
is:
private static final Pattern RFC3339_PATTERN = Pattern.compile(
"^(\\d{4})-(\\d{2})-(\\d{2})" // yyyy-MM-dd
+ "([Tt](\\d{2}):(\\d{2}):(\\d{2})(\\.\\d+)?)?" // 'T'HH:mm:ss.milliseconds
+ "([Zz]|([+-])(\\d{2}):?(\\d{2}))?");
And the validation code looks as follows:
// Validate the format
if (!matcher.matches()) {
logger.error("Failed to apply RFC3339 pattern on " + string);
return false;
}
This validation doesn't affect the Sync. I was able successfully to sync the next sources and destinations:
The only thing annoying here is this ERROR
message which introduces a misunderstanding from the client's point of view.
The library provides the way to modify JSON meta schema for our needs: https://github.com/networknt/json-schema-validator/blob/master/doc/validators.md. I belive we can replace the constructor with smth like:
public JsonSchemaValidator() {
this.schemaValidatorsConfig = new SchemaValidatorsConfig();
String URI = "https://json-schema.org/draft/2019-09/schema";
JsonSchemaFactory.getInstance(SpecVersion.VersionFlag.V7);
final JsonMetaSchema overrideEmailValidatorMetaSchema = new JsonMetaSchema.Builder(URI)
.idKeyword("$id")
.addFormat(
new PatternFormat("date-time", "^(\\d{4})-(\\d{2})-(\\d{2}) ((\\d{2}):(\\d{2}):(\\d{2})(\\.\\d+)?)?([Zz]|([+-])(\\d{2}):?(\\d{2}))?"))
.build();
this.jsonSchemaFactory = new JsonSchemaFactory.Builder()
.addMetaSchema(overrideEmailValidatorMetaSchema)
.build();
}
@grishick what team is responsible for the airbyte-workers
module? Could you address the issue, please? Thanks, in advance.
CC: @bazarnov, @alexandr-shegeda
In addition to what's already said above:
source-amplitude v0.1.7
with base-normalization
- and it works as expected. No issues occurred, all target table is populated correctly, and data-time columns are full of expected data, but in Airbyte UI > logs
we keep seeing this error about Failed to apply RFC3339 pattern on...
Could we try to adopt the existing pattern:
private static final Pattern RFC3339_PATTERN = Pattern.compile(
"^(\\d{4})-(\\d{2})-(\\d{2})" // yyyy-MM-dd
+ "([Tt](\\d{2}):(\\d{2}):(\\d{2})(\\.\\d+)?)?" // 'T'HH:mm:ss.milliseconds
+ "([Zz]|([+-])(\\d{2}):?(\\d{2}))?");
to validate the 2022-05-16 19:13:09.369000
format date-time strings as well? Is this going to be a big change?
The example date-time string which causes issue: 2022-05-16 19:13:09.369000 This is the usual RFC3339 format, however destination s3 handles it incorrectly.
@bazarnov It seems like date-times without a timezone offset are not valid in RFC3339, so I don't think updating the regex would be appropriate. See https://datatracker.ietf.org/doc/html/rfc3339#section-5.6 for the definitions, specifically:
date-time = full-date "T" full-time
full-time = partial-time time-offset
time-offset = "Z" / time-numoffset
While there's a note that the "T" can be replaced by a space character, time-offset
should still be specified. I believe this is the main reason why the Amplitude datetimes are coming back as invalid (no offset).
https://github.com/airbytehq/airbyte/pull/13351 will make it so that the logs aren't spammed with these messages, but Amplitude should still be addressed.
Looks like there are three issues here
@alexandertsukanov since you already found the place in the code where to update the regex - go ahead and make the change and add me as reviewer to the PR.
As far as I can see from briefly perusing Amplitude docs, we may need to fix how we represent Amplitude schemas:
YYYY-MM-DD
format, but our schema says it's date-time
in annotations.json
date
field, which is also YYYY-MM-DD
, but we have "format": "date-time"
in average_session_length.json
ISO-8601
format rather than RFC3339
and in events.json
we represent it as "format": "date-time"
, which is validates it against RFC3339
@pedroslopez could you please open a bug for Amplitude connector to represent date
and ISO-8601
fields as dates rather than date-time
? The solution may be just to remove "format": "date-time"
from the schema.
@grishick I don't think we want to do number 1: we need to update the pattern in JsonSchemaValidator to allow a space as well as a T between date and time
Like @pedroslopez said, this is actually a validation error and so I think the correct solution is to fix the connectors instead of changing how we are doing validation.
@grishick I don't think we want to do number 1: we need to update the pattern in JsonSchemaValidator to allow a space as well as a T between date and time
Like @pedroslopez said, this is actually a validation error and so I think the correct solution is to fix the connectors instead of changing how we are doing validation.
Good point. We don't need to change schema validation until we see another RFC3339 field with a space instead of T
.
According to the discussion above, we decided to move forward with fixing a Source Amplitude
date-time format. In such case, I will assign this ticket to the Python team. Thanks.
CC: @bazarnov, @grishick, @alexandr-shegeda, @pedroslopez, @alovew
Just to follow and warn about other connectors impacted by this issue : The Hubspot connector have the same issue regarding RFC3339 validation against date-time (space between date and time)
Taking into the account fact @shrodingers mentioned, we can proceed in a next way:
Not sure about the second one, as according to https://datatracker.ietf.org/doc/html/rfc3339#section-5.6 JSON specification date-time should contain 'T'
between date and time:
date-time = full-date "T" full-time
So, looks like java library validation is correct.
Please, advise. Thanks.
It sounds like we are not allowed to use any other date-time
formats except of ISO8601
, but any other connector could produce any date-time
format based on specification of recourse API. There are 3 ways here:
date-time
from schema to avoid validation at all (bad way)JSON Validator
(acceptable way, because takes 1 shot to cover multiple connectors)date-time
values from original to ISO8601
(that have T
between date and time) (bad way, requires modification in every connector the problem might be present)As for Source Amplitude, the easiest way is to remove the date-time
format from schema specification, but this will not resolve the main problem remains in JSON validator, which accepts only 1 format and fails on others. We should have a list of formats acceptable for majority of connectors, such as ISO8601 + RFC3339
at least.
We should keep in mind: if we will implement such an approach which Alex suggested:
allow validation via custom formats on the side of JSON Validator (acceptable way, because takes 1 shot to cover multiple connectors)
We do not have guarantees other date-time
formats won't produce such errors, in the future. However, it sounds like the best option, for now, is to extend JsonSchemaValidator
.
We are either getting closer to specification or moving away.
In a meanwhile, I'll try to fix the source amplitude
by removing date-time
format specifications from schemas.
We are using JSONSchema to specify types, and JSONSchema specifically states that date-times should be represented RFC3339. As such, I think the right solution is to have connectors output date-times in RFC3339, rather than simply remove "format": "date-time"
from the stream schemas.
It does seem like there may be issues with the way that our version of json-schema-validator is validating date-times. Later versions have moved away from having an internal regex and instead relying on the itu to do date-time validation. This library is dedicated to parsing datetimes via RFC3339 and does support spaces as separators . If possible, updating our json-schema-validator
dependency from 1.0.42
to a later version might help with the issue. However, with Amplitude specifically it is in fact outputting an non-RFC3339 date-time anyway, because even though using space instead of "T" is valid, it still does not use a timezone offset, which is not valid.
Just to follow and warn about other connectors impacted by this issue : The Hubspot connector have the same issue regarding RFC3339 validation against date-time (space between date and time)
@shrodingers Hubspot has been updated to use RFC3339 as of v0.1.60 https://github.com/airbytehq/airbyte/pull/13159
We are using JSONSchema to specify types, and JSONSchema specifically states that date-times should be represented RFC3339. As such, I think the right solution is to have connectors output date-times in RFC3339, rather than simply remove
"format": "date-time"
from the stream schemas.However, with Amplitude specifically it is in fact outputting an non-RFC3339 date-time anyway, because even though using space instead of "T" is valid, it still does not use a timezone offset, which is not valid.
The problem is Amplitude API, not Amplitude connector, because Amplitude API does not include time zone. However, I don't think we should modify the date and date-time data coming from Amplitude API within our connector by adding time zone to the values.
@pedroslopez we are using version 7 of the json schema validator, whose specs can be found here: https://json-schema.org/specification-links.html#draft-7
we hadn't planned a version update but I don't think the platform team wants to modify the internal json schema validation library, so if future versions seem better we can discuss that
For Airbyte our date-time format is ISO8601. Changing that is a big change. If sources are emitting timestamps in another format, then they should be updated to use ISO8601 instead.
If you want to explore using RFC3339 then let's make sure to loop in @sherifnada , @bleonard as this has big connotations.
I agree with @grishick , even if the API we are pulling from doesn't produce a timestamp in the correct format (or missing timezone), then we should add it.
Here's a short PR validating that validator works as expected: https://github.com/airbytehq/airbyte/pull/13391/files
@cgardens I'm totally agree that we should not modify the connectors output just because our validator expects only ISO8601
, however, I'm also agree that this is the big change, if we implement other pattern to validate against.
Question: if we just remove date-time
format from schemas for source-amplitude
it should pass the validation, since we declare date-time inside data to be a type of string
. Is this would be a normal way to handle this?
We could also try to mutate the connector's output like this:
Raw output: {"updated_date": "2022-06-01 01:01:01.12345"
- which is RFC3339
Modified output: {"updated_date": "2022-06-01T01:01:01.12345"
- which is ISO8601
In this way we keep the original data, by simply adding the T
in between of date and time.
I'm trying to understand what is the best approach in this situation. Thanks.
@cgardens I'm totally agree that we should not modify the connectors output just because our validator expects only
ISO8601
, however, I'm also agree that this is the big change, if we implement other pattern to validate against.
I disagree in that connectors should not format the dates to conform to a standard. Whether that's ISO8601 or RFC3339, I think date-times should be output in one way across connectors instead of however the source API is returning them so that airbyte knows how to handle them properly in other steps in the process.
If airbyte expects datetimes to be in ISO8601, then we should probably stop using format: date-time
from json schema since it expects RFC3339 and instead use airbyte_type
to say these strings are actually dates as outlined in this doc.
Question: if we just remove
date-time
format from schemas forsource-amplitude
it should pass the validation, since we declare date-time inside data to be a type ofstring
. Is this would be a normal way to handle this?
This would bypass the current RFC3339 validation that the platform is doing, but I think we need to add airbyte_type
in this case so that airbyte knows what type it's supposed to be. If the connector is not outputting in ISO8601, which amplitude is not, then it should also be updated to format the date as ISO8601.
In any case, I think this is pointing to needing more validations on the SAT to ensure that date-times are being emitted in the correct format from connectors and making it clear what format that should be.
Taking a second look: JsonSchema supports the ISO8601-subset of RFC3339. I doubt any destination cared about the bit of ISO8601 not supported in RFC3339, and no source outputs that collection of formats. For reference here is a diagram visualizing the diff https://ijmacd.github.io/rfc3339-iso8601/#:~:text=RFC%203339%20allows%20for%20other,of%20the%20smallest%20time%20value.
so i think this settles the type question: destinations already should expect RFC3339 section 5.6. for the issue itself: Amplitude should just support RFC3339 imo
For follow up work we should very clearly document the point above about RFC vs. ISO formats and make sure to encode that in DAT
Cool. I'm convinced. Historically we have said ISO8601 but seems like there is a strong argument for us to do RFC3339 instead. Sounds like the cost of switching is low so let's do it.
@sherifnada @pedroslopez @grishick @bazarnov thanks for sharing all of this!
So that I understand steps forward:
@cgardens
- If an API does emit ISO8601, then the connector for that API is responsible for converting it into RFC3339, right?
Yes, going forward all connectors should convert dates into RFC3339 as necessary. We will be updating docs and asserting this as part of the SAT (#13399).
- Update to a newer version of our json schema validator that supports RFC3339? Do we know that newer versions support it?
The current version of the validator library in use (com.networknt:json-schema-validator:1.0.42
according to this) is already supposed to be validating RFC3339, but they were doing it internally with a regex that was actually too lenient and didn't quite match the standard exactly. As part of https://github.com/networknt/json-schema-validator/issues/508, in newer versions they moved to instead doing RFC3339 validation with another library, https://github.com/ethlo/itu which is true RFC3339.
There is one interesting case which is valid in ISO8601 but is not valid RFC3339, which is date-times with a missing timezone. According to https://ijmacd.github.io/rfc3339-iso8601/:
These date-times with no timezone information are actually supported by airbyte via using a specific airbyte_type
as explained in these docs. I believe this means that requiring dates to be in RFC3339, airbyte_type=timestamp_without_timezone
wouldn't be appropriate.
@pedroslopez @grishick @cgardens
Here is the fix for source-amplitude
: https://github.com/airbytehq/airbyte/pull/13373
This should be sufficient to pass validation.
Also, checked for validation errors on latest master
- no validation errors produced.
For me syncing fails because of this error postgress -> bigquery
14:06:51.143227 [debug] [Thread-7 ]: BigQuery adapter: Retry attempt 3 of 3 after error: BadRequest("Invalid timestamp: '2021-03-12T00:00'") 2022-06-08 14:06:56 normalization > 14:06:52.921812 [debug] [Thread-7 ]: Database Error in model lever (models/generated/airbyte_tables/airbytetest/lever.sql) 2022-06-08 14:06:56 normalization > Invalid timestamp: '2021-04-20T00:00' 2022-06-08 14:06:56 normalization > compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/airbytetest/lever.sql 2022-06-08 14:06:56 normalization > 14:06:52.922293 [error] [Thread-7 ]: 3 of 3 ERROR creating table model airbytetest.lever..................................................................... [ERROR in 5.29s]
For me syncing fails because of this error postgress -> bigquery
@ro8inmorgan
Is this somehow related to source-amplitude
?
This is also causing my data loads into Postgres to fail...
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on None
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on 2022-09-22 06:00:00-06:00
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on None
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on 2022-09-22 06:00:00-06:00
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on None
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on 2022-09-22 06:00:00-06:00
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on None
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on 2022-09-22 06:00:00-06:00
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on None
2022-09-10 05:48:36 ERROR i.a.w.g.DefaultNormalizationWorker(run):89 - Normalization Failed.
2022-09-10 05:48:36 INFO i.a.w.g.DefaultNormalizationWorker(run):94 - Normalization summary: io.airbyte.config.NormalizationSummary@34ca1e5[startTime=1662788861534,endTime=1662788916264,failures=[io.airbyte.config.FailureReason@2d2eb85d[failureOrigin=normalization,failureType=system_error,internalMessage=invalid input syntax for type timestamp with time zone: "None",externalMessage=Normalization failed during the dbt run. This may indicate a problem with the data itself.,metadata=io.airbyte.config.Metadata@3e830bf8[additionalProperties={attemptNumber=1, jobId=9, from_trace_message=true}],stacktrace=AirbyteDbtError:
23 of 23 ERROR creating table model leaflink_.order_event_logs_changed_fields_ship_date................................. [ERROR in 0.24s]
Database Error in model order_event_logs_changed_fields_ship_date (models/generated/airbyte_tables/leaflink_/order_event_logs_changed_fields_ship_date.sql)
invalid input syntax for type timestamp with time zone: "None"
compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/leaflink_/order_event_logs_changed_fields_ship_date.sql
23 of 23 ERROR creating table model leaflink_.order_event_logs_changed_fields_ship_date................................. [ERROR in 0.24s]
@pjatx could you share the full log of the sync?
is this from amplitude to postgres?
@sherifnada no it's not, I found the issue I think. The field in question was sometimes returning "None" so decided to just make it a plain 'ole string - sorry for the noise!
With given connection: Source Amplitude <> Destination S3 (CSV, no compression, root level flattening), The issue with date-time parsing comes up:
Failed to apply RFC3339 pattern on 2022-05-16 19:13:09.369000
The example date-time string which causes issue:
2022-05-16 19:13:09.369000
This is the usualRFC3339
format, however destination s3 handles it incorrectly.More info in this issue: https://github.com/airbytehq/airbyte/issues/13057