Closed nickdegiacomo closed 2 years ago
@nickdegiacomo - Can you provide a sample dataset or (even better) a recorded singer jsonl stream that reproduces this issue?
For instance, if tap-mysomething | target-athena
creates the issue, can you run tap-mysomething > outfile.jsonl
and post the result with any sensitive data masked or redacted?
We can then test the repro (and eventually confirm the fix) with: cat outfile.jsonl | target-athena
Not familiar with raw singer. Will take some time. Tried following their setup and it didn't work on the first pass.
There's alot of PII. Since I'm not sure where the error is happening, I'll need to censor some example rows by hand.
That's okay, @nickdegiacomo. Can you check a few rows and see if there's any specific symptom that predicts the issue?
For instance, does it occur after blank cells, after cells with newlines, after cells containing commas, double quotes, or similar special characters?
@nickdegiacomo
On second reading, this popped out to me:
I have been trying to run a tap-shopify to target-athena pipeline... The data is heavily nested and includes arrays of JSON. e.g., If a customer purchases 1 item vs. five items, then
line_items
(and other columns) will grow to fit information on each.
Two follow-ups:
line_item
entries affected (prior and subsequent rows fine), or are all the records in that batch also shifted?The example I'm looking at now comes from a different app id source and so some information is filled in/omitted. Though I'm not sure what order it's processing columns.
In the example I'm looking at right now, I'm not seeing line items shifted 3x for 4 items.
Also after checking more closely I'm realizing that the column names are not at all accurate to original json fields.
I'll get an example file after I scrub it, but generally the issue is https://github.com/MeltanoLabs/target-athena/blob/main/target_athena/athena.py#L170
The Serilizer chosen here is OpenCSVSerde. This is fine as long as your data complies with the spec that OpenCSVSerde expects. https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html expects all of your data to be quote enclosed. I took a look at the CSV's being uploaded to s3. They are not quote enclosed which means that we need to get target-athena to output valid CSV data in regards to the.
Another option is to use the LazySimpleSerDe. I'm sure there's tradeoffs being made with that option but it's much better than what is happening here.
Attached is a scrubbed example of the problem happening. Run this by doing
cat scrubbed_athena_example.txt | meltano invoke target-athena
Just about every column is off in Athena
My gut feeling is the json objects are causing the issue
Here's a workaround for this
There's an undocumented feature for object_format
in this target
If you set the config object_format: jsonl
it'll push the data up in a JSON format and bypass all of this CSV stuff.
This seems like a better default than csv.
(just repeating what I said in the other issue): Yeah @visch I agree. jsonl is the better option at the moment. I haven't had time or need to work on this target much recently, but I want to eventually add parquet support (#9 )
Thanks @nickdegiacomo :D
Note the related bug in target-csv has been resolved, and is available for replicating here if/when someone needs the imporoved CSV support prioritized. (Prior resolution was to make jsonl the default.)
I have been trying to run a tap-shopify to target-athena pipeline. However, when checking created tables in Athena, some cells are misaligned.
The same issue happens in target-csv. The data is heavily nested and includes arrays of JSON. e.g., If a customer purchases 1 item vs. five items, then line_items (and other columns) will grow to fit information on each.