Closed dorcieg closed 5 years ago
Hey @dorcieg, sorry bout that, just now getting around to this.
invalid_records_detect
Isn't BrokenSo, invalid_records_detect
is working correctly here. The detection has to do with JSON Schema, and the input stream. So since we are handed a JSON Schema for every stream, we can actually validate the data we receive against that before doing any work with the remote db.
For example, you give us a schema saying the field a
is always an integer, and then we get a boolean...before we hit your database, we can inform you of this specific issue in the data itself, making the end debugging (hopefully) simpler.
In this particular case, you have an actual issue with what we're trying to place into Redshift. So all of our detection stuffs are happy, and then Redshift is telling us that it can't understand the CSV we've asked it to upload.
That hex character is UTF8 Unicode no? Something like an upside down pi
symbol?
If any of the above has tripped something in your brain 💡 which gives you any insight on what's going on, please share!
Is the data and or schema for this particular stream sensitive? ie, needing an NDA?
If no, can you upload the JSON stream and I can whip up a failing test we can use to figure out the issue with?
If yes, if you can whittle the schema and everything down to a small subset which is failing, and then upload that?
My best guess without any additional information is that you're using an encoding which we can't/don't support. This Pull Request https://github.com/datamill-co/target-redshift/pull/19 has a bunch of information on this and might be of use to you!
If this guess is correct I think we should consider adding some logging/documentation about expecting UTF-8, not Hex etc. Also, if this is a tap-hubspot
related issue, @mirelagrigoras might be able to offer some help/insight as I know they've brought up a number of problems related there.
Hi @AlexanderMann... thanks for the reply. So I was able to locate the exact value that is giving me this issue. In the Hubspot for submission data there is an array of custom field names and their values. One of the form submissions has a value "[\"O=\ufffd\ufffd\ufffd\u0000\ufffd\ufffd@\ufffd"
which is causing the error. @mirelagrigoras were you able to solve a similar issue?
Hello! @dorcieg I did not have a problem with the forms stream, but I encountered something similar for email_events. What I did was to make my own version of tap in which I replaced any hexadecimal encoded characters with "?". You can see the code that I used here: https://github.com/mirelagrigoras/rdb-comp-tap-hubspot/blob/master/rdb_comp_tap_hubspot/rdb_comp_tap_hubspot.py#L196 I cannot guarantee that this would work in your case, but that is what I did and solved my problem. I also did a pull request to tap_huspot, but it seems to me that they are not so efficient when it comes to adjusting and improving the tap as @AlexanderMann and @awm33 are with the target. If i can be of any help, please let me know.
I wonder if you could write a simple pipe command (maybe even in bash?) which would just search for hex characters and sed
them with ?
(or whatever else)...
Well in my case I'm creating the form submission function in the hubspot tap so I will totally incorporate that into my code @mirelagrigoras... thank so much for the suggestion! And thanks @AlexanderMann for your quick responses :-)
According to the readme by adding
invalid_records_detect
set tofalse
in the config Redshift won't crash when there is an invalid records. I've added it to my config file but each time Redshift encounters a malformed row I get the following error and my pipe crashes:From
stl_load_errors
:Missing newline: Unexpected character 0xffffffbd found at location 34