datamill-co / target-redshift

A Singer.io Target for Redshift
MIT License
23 stars 17 forks source link

Handle hex characters #19

Closed mirelagrigoras closed 5 years ago

mirelagrigoras commented 5 years ago

While using target-redshift with tap-hubspot, I have encountered a problem: the tap sent hex characters to the target, causing the import to fail. Yes, the tap should also be modified so that it would not pass invalid characters or hex characters, but the target should handle these cases anyway. For example, the content generated in the hubspot tap contained \0x00 in a field, which in Redshift is treated as an end of record (EOR) marker, causing the remainder of the record to be truncated: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-escape ) Instead of failing the import because the input from the tap contains some hex characters, they should be removed from the string before uploading the content into S3.

AlexanderMann commented 5 years ago

@mirelagrigoras this is great, thanks!

Few points/questions:

Target Postgres

I'm curious as to whether this is also an issue for target-postgres...

Commit History

To be clear, the changes you made in here are some formatting changes in the README.md and then the very last commit herein, titled Handle hex characters.?

If this is the case, can you possibly rebase your fork on master? The commit history looks to have gotten borked at some point.

Behaviour of these changes

To understand correctly, your changes herein specifically strip out any hex characters yes? ie, if a string comes through which is hex encoded, it will be turned into an empty string?

I (or @awm33) might have followup questions on this last point, but I'm trying to make sure we're in lock step here before proceeding.

mirelagrigoras commented 5 years ago

@AlexanderMann Only the last commit is related to this pull request, the other ones appear because my local branch was behind origin/master and when I merged the origin/master branch into the local master, those commits appeared. I will try to rebase the fork on master, as you suggested, but I already fetched and pulled locally the changes and this is how it ended up like this.

As for the last point, yes, you understand correctly.I suggest that replacing hex encoded characters with an empty string.

I just wanted to mention that I do not see leaving the code as it is this as an issue of target-redshift, from my point of view, hadling them would be an enhacement.

awm33 commented 5 years ago

This seems like a potential issue with hubspot and not the target. The target expects UTF-8 encoded text, which the tap should be producing. Have you thought of making an issue and perhaps a PR with tap-hubspot?

mirelagrigoras commented 5 years ago

Yes, I agree that this is an issue of the tap, but I thought that the target should be able to handle hex encoded text, if it is received from the tap. I was not aware of the target expecting UTF-8 encoded text, so thank you. I have created an issue and I will make a pull request as well. Thank you!