airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.23k stars 3.92k forks source link

:bug: Source Amplitude: Failed to apply RFC3339 pattern on date-time string #13075

Closed bazarnov closed 2 years ago

bazarnov commented 2 years ago

With given connection: Source Amplitude <> Destination S3 (CSV, no compression, root level flattening), The issue with date-time parsing comes up:

Failed to apply RFC3339 pattern on 2022-05-16 19:13:09.369000

The example date-time string which causes issue: 2022-05-16 19:13:09.369000 This is the usual RFC3339 format, however destination s3 handles it incorrectly.

More info in this issue: https://github.com/airbytehq/airbyte/issues/13057

bazarnov commented 2 years ago

@alexandr-shegeda Please take a look, sounds like minor bug.

eliyarson commented 2 years ago

We are experiencing the same bug here with BQ destination. It's generating a massive amount of logs which is throttling the MinIO deployment.

agiletich commented 2 years ago

Same with postgres (connectors/destination/postgres )

alexandertsukanov commented 2 years ago

During the investigation, I found that the current bug was not related to either Python or Java Connectors. The code which triggers the issue is located in airbyte-workers:

https://github.com/airbytehq/airbyte/blob/3dcda7ae525129ed1dfb43e44083b121eeefc250/airbyte-workers/src/main/java/io/airbyte/workers/general/DefaultReplicationWorker.java#L309-L309

It looks like io/airbyte/validation/json/JsonSchemaValidator.java uses a library that has a regexp pattern for date-time nodes.

The dependeny tree as follows: airbyte-workers -> airbyte-json-validation -> com.networknt:json-schema-validator:1.0.42

The library is: https://github.com/airbytehq/airbyte/blob/9d1cd42ff9f3118e2312ea9c94ad647f1baaad73/airbyte-json-validation/build.gradle#L6-L6

And the line where the JsonSchemaFactory instance is created:

https://github.com/airbytehq/airbyte/blob/9d1cd42ff9f3118e2312ea9c94ad647f1baaad73/airbyte-json-validation/src/main/java/io/airbyte/validation/json/JsonSchemaValidator.java#L33

The regexp used by the com.networknt:json-schema-validator:1.0.42 is:

private static final Pattern RFC3339_PATTERN = Pattern.compile(
            "^(\\d{4})-(\\d{2})-(\\d{2})" // yyyy-MM-dd
                    + "([Tt](\\d{2}):(\\d{2}):(\\d{2})(\\.\\d+)?)?" // 'T'HH:mm:ss.milliseconds
                    + "([Zz]|([+-])(\\d{2}):?(\\d{2}))?");

And the validation code looks as follows:

 // Validate the format
        if (!matcher.matches()) {
            logger.error("Failed to apply RFC3339 pattern on " + string);
            return false;
        }

This validation doesn't affect the Sync. I was able successfully to sync the next sources and destinations:

The only thing annoying here is this ERROR message which introduces a misunderstanding from the client's point of view.

The library provides the way to modify JSON meta schema for our needs: https://github.com/networknt/json-schema-validator/blob/master/doc/validators.md. I belive we can replace the constructor with smth like:

public JsonSchemaValidator() {
    this.schemaValidatorsConfig = new SchemaValidatorsConfig();
    String URI = "https://json-schema.org/draft/2019-09/schema";
    JsonSchemaFactory.getInstance(SpecVersion.VersionFlag.V7);
    final JsonMetaSchema overrideEmailValidatorMetaSchema = new JsonMetaSchema.Builder(URI)
        .idKeyword("$id")
        .addFormat(
            new PatternFormat("date-time", "^(\\d{4})-(\\d{2})-(\\d{2}) ((\\d{2}):(\\d{2}):(\\d{2})(\\.\\d+)?)?([Zz]|([+-])(\\d{2}):?(\\d{2}))?"))
        .build();
    this.jsonSchemaFactory = new JsonSchemaFactory.Builder()
        .addMetaSchema(overrideEmailValidatorMetaSchema)
        .build();
  }

@grishick what team is responsible for the airbyte-workers module? Could you address the issue, please? Thanks, in advance.

CC: @bazarnov, @alexandr-shegeda

bazarnov commented 2 years ago

In addition to what's already said above:

Could we try to adopt the existing pattern:

private static final Pattern RFC3339_PATTERN = Pattern.compile(
            "^(\\d{4})-(\\d{2})-(\\d{2})" // yyyy-MM-dd
                    + "([Tt](\\d{2}):(\\d{2}):(\\d{2})(\\.\\d+)?)?" // 'T'HH:mm:ss.milliseconds
                    + "([Zz]|([+-])(\\d{2}):?(\\d{2}))?");

to validate the 2022-05-16 19:13:09.369000 format date-time strings as well? Is this going to be a big change?

pedroslopez commented 2 years ago

The example date-time string which causes issue: 2022-05-16 19:13:09.369000 This is the usual RFC3339 format, however destination s3 handles it incorrectly.

@bazarnov It seems like date-times without a timezone offset are not valid in RFC3339, so I don't think updating the regex would be appropriate. See https://datatracker.ietf.org/doc/html/rfc3339#section-5.6 for the definitions, specifically:

date-time       = full-date "T" full-time
full-time       = partial-time time-offset
time-offset     = "Z" / time-numoffset

While there's a note that the "T" can be replaced by a space character, time-offset should still be specified. I believe this is the main reason why the Amplitude datetimes are coming back as invalid (no offset).

https://github.com/airbytehq/airbyte/pull/13351 will make it so that the logs aren't spammed with these messages, but Amplitude should still be addressed.

grishick commented 2 years ago

Looks like there are three issues here

  1. we need to update the pattern in JsonSchemaValidator to allow a space as well as a T between date and time
  2. unnecessary error logging is being already taken care of in #13351
  3. potentially, there is a bug in Amplitude source connector or Amplitude API wrt missing time offset

@alexandertsukanov since you already found the place in the code where to update the regex - go ahead and make the change and add me as reviewer to the PR.

As far as I can see from briefly perusing Amplitude docs, we may need to fix how we represent Amplitude schemas:

@pedroslopez could you please open a bug for Amplitude connector to represent date and ISO-8601 fields as dates rather than date-time? The solution may be just to remove "format": "date-time" from the schema.

alovew commented 2 years ago

@grishick I don't think we want to do number 1: we need to update the pattern in JsonSchemaValidator to allow a space as well as a T between date and time

Like @pedroslopez said, this is actually a validation error and so I think the correct solution is to fix the connectors instead of changing how we are doing validation.

grishick commented 2 years ago

@grishick I don't think we want to do number 1: we need to update the pattern in JsonSchemaValidator to allow a space as well as a T between date and time

Like @pedroslopez said, this is actually a validation error and so I think the correct solution is to fix the connectors instead of changing how we are doing validation.

Good point. We don't need to change schema validation until we see another RFC3339 field with a space instead of T.

alexandertsukanov commented 2 years ago

According to the discussion above, we decided to move forward with fixing a Source Amplitudedate-time format. In such case, I will assign this ticket to the Python team. Thanks.

CC: @bazarnov, @grishick, @alexandr-shegeda, @pedroslopez, @alovew

shrodingers commented 2 years ago

Just to follow and warn about other connectors impacted by this issue : The Hubspot connector have the same issue regarding RFC3339 validation against date-time (space between date and time)

alexandertsukanov commented 2 years ago

Taking into the account fact @shrodingers mentioned, we can proceed in a next way:

  1. Create the separate tickets for the Python team where the issue with format appears, for now, we know at least the next connectors:
    • Amplitude
    • Hubspot
  2. Fix the JsonSchemaValidator to allow a space as well, on the Java side.

Not sure about the second one, as according to https://datatracker.ietf.org/doc/html/rfc3339#section-5.6 JSON specification date-time should contain 'T' between date and time: date-time = full-date "T" full-time

So, looks like java library validation is correct.

Please, advise. Thanks.

bazarnov commented 2 years ago

It sounds like we are not allowed to use any other date-time formats except of ISO8601, but any other connector could produce any date-time format based on specification of recourse API. There are 3 ways here:

As for Source Amplitude, the easiest way is to remove the date-time format from schema specification, but this will not resolve the main problem remains in JSON validator, which accepts only 1 format and fails on others. We should have a list of formats acceptable for majority of connectors, such as ISO8601 + RFC3339 at least.

alexandertsukanov commented 2 years ago

We should keep in mind: if we will implement such an approach which Alex suggested:

allow validation via custom formats on the side of JSON Validator (acceptable way, because takes 1 shot to cover multiple connectors)

We do not have guarantees other date-time formats won't produce such errors, in the future. However, it sounds like the best option, for now, is to extend JsonSchemaValidator.

We are either getting closer to specification or moving away.

bazarnov commented 2 years ago

In a meanwhile, I'll try to fix the source amplitude by removing date-time format specifications from schemas.

pedroslopez commented 2 years ago

We are using JSONSchema to specify types, and JSONSchema specifically states that date-times should be represented RFC3339. As such, I think the right solution is to have connectors output date-times in RFC3339, rather than simply remove "format": "date-time" from the stream schemas.

It does seem like there may be issues with the way that our version of json-schema-validator is validating date-times. Later versions have moved away from having an internal regex and instead relying on the itu to do date-time validation. This library is dedicated to parsing datetimes via RFC3339 and does support spaces as separators . If possible, updating our json-schema-validator dependency from 1.0.42 to a later version might help with the issue. However, with Amplitude specifically it is in fact outputting an non-RFC3339 date-time anyway, because even though using space instead of "T" is valid, it still does not use a timezone offset, which is not valid.

Just to follow and warn about other connectors impacted by this issue : The Hubspot connector have the same issue regarding RFC3339 validation against date-time (space between date and time)

@shrodingers Hubspot has been updated to use RFC3339 as of v0.1.60 https://github.com/airbytehq/airbyte/pull/13159

grishick commented 2 years ago

We are using JSONSchema to specify types, and JSONSchema specifically states that date-times should be represented RFC3339. As such, I think the right solution is to have connectors output date-times in RFC3339, rather than simply remove "format": "date-time" from the stream schemas.

However, with Amplitude specifically it is in fact outputting an non-RFC3339 date-time anyway, because even though using space instead of "T" is valid, it still does not use a timezone offset, which is not valid.

The problem is Amplitude API, not Amplitude connector, because Amplitude API does not include time zone. However, I don't think we should modify the date and date-time data coming from Amplitude API within our connector by adding time zone to the values.

alovew commented 2 years ago

@pedroslopez we are using version 7 of the json schema validator, whose specs can be found here: https://json-schema.org/specification-links.html#draft-7

we hadn't planned a version update but I don't think the platform team wants to modify the internal json schema validation library, so if future versions seem better we can discuss that

cgardens commented 2 years ago

For Airbyte our date-time format is ISO8601. Changing that is a big change. If sources are emitting timestamps in another format, then they should be updated to use ISO8601 instead.

If you want to explore using RFC3339 then let's make sure to loop in @sherifnada , @bleonard as this has big connotations.

I agree with @grishick , even if the API we are pulling from doesn't produce a timestamp in the correct format (or missing timezone), then we should add it.

cgardens commented 2 years ago

Here's a short PR validating that validator works as expected: https://github.com/airbytehq/airbyte/pull/13391/files

bazarnov commented 2 years ago

@cgardens I'm totally agree that we should not modify the connectors output just because our validator expects only ISO8601, however, I'm also agree that this is the big change, if we implement other pattern to validate against.

Question: if we just remove date-time format from schemas for source-amplitude it should pass the validation, since we declare date-time inside data to be a type of string. Is this would be a normal way to handle this?

We could also try to mutate the connector's output like this: Raw output: {"updated_date": "2022-06-01 01:01:01.12345" - which is RFC3339 Modified output: {"updated_date": "2022-06-01T01:01:01.12345" - which is ISO8601

In this way we keep the original data, by simply adding the T in between of date and time.

I'm trying to understand what is the best approach in this situation. Thanks.

pedroslopez commented 2 years ago

@cgardens I'm totally agree that we should not modify the connectors output just because our validator expects only ISO8601, however, I'm also agree that this is the big change, if we implement other pattern to validate against.

I disagree in that connectors should not format the dates to conform to a standard. Whether that's ISO8601 or RFC3339, I think date-times should be output in one way across connectors instead of however the source API is returning them so that airbyte knows how to handle them properly in other steps in the process.

If airbyte expects datetimes to be in ISO8601, then we should probably stop using format: date-time from json schema since it expects RFC3339 and instead use airbyte_type to say these strings are actually dates as outlined in this doc.

Question: if we just remove date-time format from schemas for source-amplitude it should pass the validation, since we declare date-time inside data to be a type of string. Is this would be a normal way to handle this?

This would bypass the current RFC3339 validation that the platform is doing, but I think we need to add airbyte_type in this case so that airbyte knows what type it's supposed to be. If the connector is not outputting in ISO8601, which amplitude is not, then it should also be updated to format the date as ISO8601.

In any case, I think this is pointing to needing more validations on the SAT to ensure that date-times are being emitted in the correct format from connectors and making it clear what format that should be.

sherifnada commented 2 years ago

Taking a second look: JsonSchema supports the ISO8601-subset of RFC3339. I doubt any destination cared about the bit of ISO8601 not supported in RFC3339, and no source outputs that collection of formats. For reference here is a diagram visualizing the diff https://ijmacd.github.io/rfc3339-iso8601/#:~:text=RFC%203339%20allows%20for%20other,of%20the%20smallest%20time%20value.

so i think this settles the type question: destinations already should expect RFC3339 section 5.6. for the issue itself: Amplitude should just support RFC3339 imo

For follow up work we should very clearly document the point above about RFC vs. ISO formats and make sure to encode that in DAT

cgardens commented 2 years ago

Cool. I'm convinced. Historically we have said ISO8601 but seems like there is a strong argument for us to do RFC3339 instead. Sounds like the cost of switching is low so let's do it.

@sherifnada @pedroslopez @grishick @bazarnov thanks for sharing all of this!

So that I understand steps forward:

pedroslopez commented 2 years ago

@cgardens

  • If an API does emit ISO8601, then the connector for that API is responsible for converting it into RFC3339, right?

Yes, going forward all connectors should convert dates into RFC3339 as necessary. We will be updating docs and asserting this as part of the SAT (#13399).

  • Update to a newer version of our json schema validator that supports RFC3339? Do we know that newer versions support it?

The current version of the validator library in use (com.networknt:json-schema-validator:1.0.42 according to this) is already supposed to be validating RFC3339, but they were doing it internally with a regex that was actually too lenient and didn't quite match the standard exactly. As part of https://github.com/networknt/json-schema-validator/issues/508, in newer versions they moved to instead doing RFC3339 validation with another library, https://github.com/ethlo/itu which is true RFC3339.


There is one interesting case which is valid in ISO8601 but is not valid RFC3339, which is date-times with a missing timezone. According to https://ijmacd.github.io/rfc3339-iso8601/:

image

image

These date-times with no timezone information are actually supported by airbyte via using a specific airbyte_type as explained in these docs. I believe this means that requiring dates to be in RFC3339, airbyte_type=timestamp_without_timezone wouldn't be appropriate.

bazarnov commented 2 years ago

@pedroslopez @grishick @cgardens Here is the fix for source-amplitude: https://github.com/airbytehq/airbyte/pull/13373 This should be sufficient to pass validation.

Also, checked for validation errors on latest master - no validation errors produced.

ro8inmorgan commented 2 years ago

For me syncing fails because of this error postgress -> bigquery

14:06:51.143227 [debug] [Thread-7 ]: BigQuery adapter: Retry attempt 3 of 3 after error: BadRequest("Invalid timestamp: '2021-03-12T00:00'") 2022-06-08 14:06:56 normalization > 14:06:52.921812 [debug] [Thread-7 ]: Database Error in model lever (models/generated/airbyte_tables/airbytetest/lever.sql) 2022-06-08 14:06:56 normalization > Invalid timestamp: '2021-04-20T00:00' 2022-06-08 14:06:56 normalization > compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/airbytetest/lever.sql 2022-06-08 14:06:56 normalization > 14:06:52.922293 [error] [Thread-7 ]: 3 of 3 ERROR creating table model airbytetest.lever..................................................................... [ERROR in 5.29s]

bazarnov commented 2 years ago

For me syncing fails because of this error postgress -> bigquery

@ro8inmorgan Is this somehow related to source-amplitude?

pjatx commented 1 year ago

This is also causing my data loads into Postgres to fail...

2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on None
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on 2022-09-22 06:00:00-06:00
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on None
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on 2022-09-22 06:00:00-06:00
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on None
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on 2022-09-22 06:00:00-06:00
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on None
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on 2022-09-22 06:00:00-06:00
2022-09-10 05:24:55 ERROR c.n.s.DateTimeValidator(isLegalDateTime):70 - Failed to apply RFC3339 pattern on None
2022-09-10 05:48:36 ERROR i.a.w.g.DefaultNormalizationWorker(run):89 - Normalization Failed.
2022-09-10 05:48:36 INFO i.a.w.g.DefaultNormalizationWorker(run):94 - Normalization summary: io.airbyte.config.NormalizationSummary@34ca1e5[startTime=1662788861534,endTime=1662788916264,failures=[io.airbyte.config.FailureReason@2d2eb85d[failureOrigin=normalization,failureType=system_error,internalMessage=invalid input syntax for type timestamp with time zone: "None",externalMessage=Normalization failed during the dbt run. This may indicate a problem with the data itself.,metadata=io.airbyte.config.Metadata@3e830bf8[additionalProperties={attemptNumber=1, jobId=9, from_trace_message=true}],stacktrace=AirbyteDbtError: 
23 of 23 ERROR creating table model leaflink_.order_event_logs_changed_fields_ship_date................................. [ERROR in 0.24s]
Database Error in model order_event_logs_changed_fields_ship_date (models/generated/airbyte_tables/leaflink_/order_event_logs_changed_fields_ship_date.sql)
  invalid input syntax for type timestamp with time zone: "None"
  compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/leaflink_/order_event_logs_changed_fields_ship_date.sql
23 of 23 ERROR creating table model leaflink_.order_event_logs_changed_fields_ship_date................................. [ERROR in 0.24s]
sherifnada commented 1 year ago

@pjatx could you share the full log of the sync?

sherifnada commented 1 year ago

is this from amplitude to postgres?

pjatx commented 1 year ago

@sherifnada no it's not, I found the issue I think. The field in question was sometimes returning "None" so decided to just make it a plain 'ole string - sorry for the noise!