datamill-co / target-redshift

A Singer.io Target for Redshift
MIT License
23 stars 17 forks source link

invalid_records_detect in config not being respected #25

Closed dorcieg closed 5 years ago

dorcieg commented 5 years ago

According to the readme by adding invalid_records_detect set to false in the config Redshift won't crash when there is an invalid records. I've added it to my config file but each time Redshift encounters a malformed row I get the following error and my pipe crashes:

ERROR Exception writing records
Traceback (most recent call last):
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 186, in write_batch
    {'version': target_table_version})
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/sql_base.py", line 831, in write_batch_helper
    metadata)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 515, in write_table_batch
    csv_rows)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 136, in persist_csv_rows
    cur.execute(copy_sql)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 53, in execute
    return super(_MillisLoggingCursor, self).execute(query, vars)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/psycopg2/extras.py", line 440, in execute
    return super(LoggingCursor, self).execute(query, vars)
psycopg2.InternalError: Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.

CRITICAL ('Exception writing records', InternalError("Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.\n"))
Traceback (most recent call last):
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 186, in write_batch
    {'version': target_table_version})
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/sql_base.py", line 831, in write_batch_helper
    metadata)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 515, in write_table_batch
    csv_rows)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 136, in persist_csv_rows
    cur.execute(copy_sql)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 53, in execute
    return super(_MillisLoggingCursor, self).execute(query, vars)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/psycopg2/extras.py", line 440, in execute
    return super(LoggingCursor, self).execute(query, vars)
psycopg2.InternalError: Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/bin/target-redshift", line 11, in <module>
    sys.exit(cli())
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 54, in cli
    main(args.config)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 48, in main
    target_tools.main(redshift_target)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 25, in main
    stream_to_target(input_stream, target, config=config)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 69, in stream_to_target
    raise e
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 60, in stream_to_target
    _flush_streams(streams, target)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 88, in _flush_streams
    _flush_stream(target, stream_buffer)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 81, in _flush_stream
    target.write_batch(stream_buffer)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 58, in write_batch
    return PostgresTarget.write_batch(self, nullable_stream_buffer)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 195, in write_batch
    raise PostgresError(message, ex)
target_postgres.postgres.PostgresError: ('Exception writing records', InternalError("Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.\n"))

From stl_load_errors: Missing newline: Unexpected character 0xffffffbd found at location 34

AlexanderMann commented 5 years ago

Hey @dorcieg, sorry bout that, just now getting around to this.

invalid_records_detect Isn't Broken

So, invalid_records_detect is working correctly here. The detection has to do with JSON Schema, and the input stream. So since we are handed a JSON Schema for every stream, we can actually validate the data we receive against that before doing any work with the remote db.

For example, you give us a schema saying the field a is always an integer, and then we get a boolean...before we hit your database, we can inform you of this specific issue in the data itself, making the end debugging (hopefully) simpler.

The 🐛

In this particular case, you have an actual issue with what we're trying to place into Redshift. So all of our detection stuffs are happy, and then Redshift is telling us that it can't understand the CSV we've asked it to upload.

That hex character is UTF8 Unicode no? Something like an upside down pi symbol?

The De- 🐛 (Debugging)

If any of the above has tripped something in your brain 💡 which gives you any insight on what's going on, please share!

Is the data and or schema for this particular stream sensitive? ie, needing an NDA?

If no, can you upload the JSON stream and I can whip up a failing test we can use to figure out the issue with?

If yes, if you can whittle the schema and everything down to a small subset which is failing, and then upload that?

My Guess

My best guess without any additional information is that you're using an encoding which we can't/don't support. This Pull Request https://github.com/datamill-co/target-redshift/pull/19 has a bunch of information on this and might be of use to you!

If this guess is correct I think we should consider adding some logging/documentation about expecting UTF-8, not Hex etc. Also, if this is a tap-hubspot related issue, @mirelagrigoras might be able to offer some help/insight as I know they've brought up a number of problems related there.

dorcieg commented 5 years ago

Hi @AlexanderMann... thanks for the reply. So I was able to locate the exact value that is giving me this issue. In the Hubspot for submission data there is an array of custom field names and their values. One of the form submissions has a value "[\"O=\ufffd\ufffd\ufffd\u0000\ufffd\ufffd@\ufffd" which is causing the error. @mirelagrigoras were you able to solve a similar issue?

mirelagrigoras commented 5 years ago

Hello! @dorcieg I did not have a problem with the forms stream, but I encountered something similar for email_events. What I did was to make my own version of tap in which I replaced any hexadecimal encoded characters with "?". You can see the code that I used here: https://github.com/mirelagrigoras/rdb-comp-tap-hubspot/blob/master/rdb_comp_tap_hubspot/rdb_comp_tap_hubspot.py#L196 I cannot guarantee that this would work in your case, but that is what I did and solved my problem. I also did a pull request to tap_huspot, but it seems to me that they are not so efficient when it comes to adjusting and improving the tap as @AlexanderMann and @awm33 are with the target. If i can be of any help, please let me know.

AlexanderMann commented 5 years ago

I wonder if you could write a simple pipe command (maybe even in bash?) which would just search for hex characters and sed them with ? (or whatever else)...

dorcieg commented 5 years ago

Well in my case I'm creating the form submission function in the hubspot tap so I will totally incorporate that into my code @mirelagrigoras... thank so much for the suggestion! And thanks @AlexanderMann for your quick responses :-)