MeltanoLabs / target-athena

Singer.io Target for AWS Athena.
Other
5 stars 16 forks source link

Cell / Column misalignment issue when tables created in Athena #27

Closed nickdegiacomo closed 2 years ago

nickdegiacomo commented 3 years ago

I have been trying to run a tap-shopify to target-athena pipeline. However, when checking created tables in Athena, some cells are misaligned.

The same issue happens in target-csv. The data is heavily nested and includes arrays of JSON. e.g., If a customer purchases 1 item vs. five items, then line_items (and other columns) will grow to fit information on each.

aaronsteers commented 3 years ago

@nickdegiacomo - Can you provide a sample dataset or (even better) a recorded singer jsonl stream that reproduces this issue?

For instance, if tap-mysomething | target-athena creates the issue, can you run tap-mysomething > outfile.jsonl and post the result with any sensitive data masked or redacted?

We can then test the repro (and eventually confirm the fix) with: cat outfile.jsonl | target-athena

nickdegiacomo commented 3 years ago

Not familiar with raw singer. Will take some time. Tried following their setup and it didn't work on the first pass.

There's alot of PII. Since I'm not sure where the error is happening, I'll need to censor some example rows by hand.

aaronsteers commented 3 years ago

That's okay, @nickdegiacomo. Can you check a few rows and see if there's any specific symptom that predicts the issue?

For instance, does it occur after blank cells, after cells with newlines, after cells containing commas, double quotes, or similar special characters?

aaronsteers commented 3 years ago

@nickdegiacomo

On second reading, this popped out to me:

I have been trying to run a tap-shopify to target-athena pipeline... The data is heavily nested and includes arrays of JSON. e.g., If a customer purchases 1 item vs. five items, then line_items (and other columns) will grow to fit information on each.

Two follow-ups:

  1. Do you mean that the line items spill into the subsequent columns? So for 4 items, your data is shifted 3 cells?
  2. Can you tell if this is a record-level or batch-level issue? Meaning: is it just the row that has the multiple line_item entries affected (prior and subsequent rows fine), or are all the records in that batch also shifted?
nickdegiacomo commented 3 years ago

The example I'm looking at now comes from a different app id source and so some information is filled in/omitted. Though I'm not sure what order it's processing columns.

In the example I'm looking at right now, I'm not seeing line items shifted 3x for 4 items.

Also after checking more closely I'm realizing that the column names are not at all accurate to original json fields.

visch commented 3 years ago

I'll get an example file after I scrub it, but generally the issue is https://github.com/MeltanoLabs/target-athena/blob/main/target_athena/athena.py#L170

The Serilizer chosen here is OpenCSVSerde. This is fine as long as your data complies with the spec that OpenCSVSerde expects. https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html expects all of your data to be quote enclosed. I took a look at the CSV's being uploaded to s3. They are not quote enclosed which means that we need to get target-athena to output valid CSV data in regards to the.

Another option is to use the LazySimpleSerDe. I'm sure there's tradeoffs being made with that option but it's much better than what is happening here.

visch commented 3 years ago

scrubbed_athena_example.txt

Attached is a scrubbed example of the problem happening. Run this by doing cat scrubbed_athena_example.txt | meltano invoke target-athena Just about every column is off in Athena

My gut feeling is the json objects are causing the issue

visch commented 3 years ago

Here's a workaround for this

There's an undocumented feature for object_format in this target

If you set the config object_format: jsonl it'll push the data up in a JSON format and bypass all of this CSV stuff.

This seems like a better default than csv.

andrewcstewart commented 3 years ago

(just repeating what I said in the other issue): Yeah @visch I agree. jsonl is the better option at the moment. I haven't had time or need to work on this target much recently, but I want to eventually add parquet support (#9 )

visch commented 2 years ago

Thanks @nickdegiacomo :D

aaronsteers commented 2 years ago

Note the related bug in target-csv has been resolved, and is available for replicating here if/when someone needs the imporoved CSV support prioritized. (Prior resolution was to make jsonl the default.)

https://github.com/MeltanoLabs/target-csv/pull/2