aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.93k stars 698 forks source link

write NULL instead of str(None) to timestream #894

Closed sutiv closed 2 years ago

sutiv commented 3 years ago
"Dimensions": [x for x in [
                        {"Name": name, "DimensionValueType": "VARCHAR", "Value": str(value)} if value else None
                        for name, value in zip(cols_names[2:], rec[2:])
                    ] if x]

This code here -> https://github.com/awslabs/aws-data-wrangler/blob/6f7249faa8402188d5edf72465abd7ea04c19a28/awswrangler/timestream.py#L54 should do this

jaidisido commented 3 years ago

I might be mistaken but I don't think the timestream write_records call accepts None, only string. See this PR with your suggested change and the associated failed logs

E               botocore.exceptions.ParamValidationError: Parameter validation failed:
E               Invalid type for parameter Records[0].Dimensions[0].Value, value: None, type: <class 'NoneType'>, valid types: <class 'str'>
sutiv commented 3 years ago

sorry for the late reply. that's right. write_records only accepts strings, but there is a casting to str() for the "Value" and if this value is None the string will contain "None" and this value is written to timestream. so as a result in timestream is "None" instead of NULL (not-filled-value)

jaidisido commented 3 years ago

The part I do not understand is that I have implemented your suggested code and that is leading to the error in the PR/logs.

Is your suggestion to simply catch any None and write a "NULL" string instead?

sutiv commented 3 years ago

I get an "AccessDenied" for the link to logs.

It is NOT a "NULL"-string instead, it does remove only a dimension, if x is None - so the result is that this dimension is not added to the dimension-list and timestream fills the value in the DB with a real NULL value. Isn't that the expected behaviour if you add a None to a dimension :)

kukushking commented 3 years ago

According to boto3 timestream docs the value must be a string so I think that leaves us with "NULL" string as the only option

sutiv commented 3 years ago

That is correct, if the dimension is mentioned. But if (like in my code) the dimension and its value aren't present (i remove them), there is no problem. I do this workaround since some weeks and it works without any issues

kukushking commented 3 years ago

Ah I see. I think for the library better experience is to explicitly fail if there is an empty dimension than to silently ignore it though

sutiv commented 3 years ago

But if I set the Dimension_Value explicit to None, I would expect to have NULL in the DB and not "None" or "NULL" or an Exception. The user explicitly specifies the NULL value, not a string or anything else. Shouldn't that be possible?

kukushking commented 3 years ago

It's not possible by the current API, however it's worth to check with timestream team if it's considered a valid use case. @jaidisido can you summon timestream folks here, please?

github-actions[bot] commented 2 years ago

This issue requires triage and should be assigned.