TIBCOSoftware / snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
http://www.snappydata.io
Other
1.04k stars 200 forks source link

SnappyData 1.2.0 always creates column names in lowercase even though create command has column names in uppercase. #1559

Closed parimalaadinarayan closed 3 years ago

parimalaadinarayan commented 3 years ago

Because of some stability issue in SnappyData 1.1.0 we thought of upgrading to recently released version i.e., SnappyData 1.2.0. We have been using Snappydata for more than a year in production.

Post upgrade, things seems to be fine w.r.t performance etc., Except for below problem, which brings us to make or break situation now.

  1. We create tables in SnappyData with column names in CAPITALS
  2. Post creation; getting description of table from SnappyData console shows that all column names are in “small” case.
  3. We thought this could just be SnappyData’s console related thing. However, this behavior is more deep.
  4. Clients who consume SnappyData tables address column names in CAPITALS; because table themselves originally are created like that.
  5. All those clients started getting “Column not found” exception the moment they access column by different case.
  6. These errors does not happen while query executes at SnappyData level, instead it happens when clients read data using Snappy Data’s JDBC driver (snappydata-jdbc_2.11-1.1.0.jar)
  7. Accessing column name using lower case names works fine.
  8. We tried using latest SnappyData’s latest JDBC driver (snappydata-jdbc_2.11-1.2.0.jar). Same behavior is observed.
  9. We tried setting the property spark.sql.caseSensitive true in spark-defaults.conf, to push in the data in the case mentioned in create table command,but it ended up storing column names in lower case.
  10. SnappySession is created using SparkContext, post that dataframes are directly pushed to SnappyData.
  11. We tried setting spark.conf.set("spark.sql.caseSensitive","true") to push in the data in the case mentioned in create table command,but it ended up storing column names in lower case.

Agenda is, upgrade process should not lead to change at client level. Hence, we are looking for any configuration/properties/something because of which SnappyData creates table and column in user given “case” itself.

sumwale commented 3 years ago

@parimalaadinarayan Two points. Firstly when you say "given case", then SQL is case-insensitive so there is actually no case for the identifiers unless you quote the column names. The second issue is that even if the column/table names are quoted (backquotes or double quotes, preferably the former), the hive metastore is case-insensitive unless one relies completely on custom strings to store in the hive metastore (making it unreadable otherwise). Since SnappyData is built on Spark that uses hive metastore, this issue remains in storage which are always in lower-case and so for queries it means that if "spark.sql.caseSensitive" is true, then you have to pass the names in lower-case, otherwise the case is ignored.

In a nutshell, regardless of how they are stored, the names specified in queries/DMLs, whether quoted or unquoted, will always be case-insensitive. The only problem should be if you have names in the same table that differ only in case. I can only say that this is a very bad idea and should have caused trouble even in version 1.1. Barring that, as long as you do not set "spark.sql.caseSensitive" to true, there will be no issue in whatever case you pass in quoted or unquoted column/table names. The only issue will be in JDBC Metadata APIs (like DatabaseMetadata) where the names specified or in the results will have a specific case as used by the storage. Is that the problem you have been having or is it something else?

If I have misunderstood the issue, can you give a small sequence of steps using the JDBC driver (e.g. on snappy shell) to demonstrate the issue?

parimalaadinarayan commented 3 years ago

@sumwale attaching the screenshots and steps to repro for the issue.

  1. Tried creating a table called columnbug with one column named "NAME" and inserted a dummy value into it.

  2. When I do select * from columnbug, column name is changed to lowercase even though create table has column name in uppercase.(Refer screenshot below) image_2021_06_03T14_38_42_738Z

  3. When the code is trying to access column name from resultset, below error occurs Exception in thread "main" java.sql.SQLException: (SQLState=S0022 Severity=0) Column 'NAME' not found. at com.pivotal.gemfirexd.internal.shared.common.error.DefaultExceptionFactory30.getSQLException(DefaultExceptionFactory30.java:44) at com.pivotal.gemfirexd.internal.shared.common.error.DefaultExceptionFactory30.getSQLException(DefaultExceptionFactory30.java:63) at com.pivotal.gemfirexd.internal.shared.common.error.ExceptionUtil.newSQLException(ExceptionUtil.java:158) at io.snappydata.thrift.internal.ClientResultSet.getColumnIndex(ClientResultSet.java:259) at io.snappydata.thrift.internal.ClientResultSet.getString(ClientResultSet.java:723) at com.pw.smp.servicesengine.helpers.SnappyDataJDBCClient.areRowsPresent1(SnappyDataJDBCClient.java:86) at com.pw.smp.servicesengine.helpers.SnappyDataJDBCClient.main(SnappyDataJDBCClient.java:99)

  4. Code snippet for your reference. `String JDBC_URL = "jdbc:snappydata://%1$s:%2$s/"; Connection connection = DriverManager.getConnection(String.format(JDBC_URL, snappyDataLocator, jdbcport)); if (connection != null) { System.out.println("Successfully connected to snappydata"); } Statement stmt = connection.createStatement(); ResultSet rs = stmt.executeQuery("SELECT * from columnbug");

            while (rs.next())
            {
                String a = rs.getString("NAME");
                System.out.println(a);
            }
    
            connection.close();`
  5. Also attaching the screenshot which shows spark.sql.caseSensitive is set to false. image_2021_06_03T14_39_18_248Z

sumwale commented 3 years ago

@parimalaadinarayan As mentioned both storage and SQL itself is normally case-insensitive so points 1, 2 above are expected behaviour. However, points 3 and 4 highlight an issue because the JDBC API should also be case-insensitive so should not have failed.

As regards point 5, the property is not really recommended to be used because hive-metastore is case-insensitive and can work as expected only if using in-memory catalog which cannot be used in embedded SnappyData cluster. Even from an external Spark cluster or a standalone one, the property should really be avoided because SQL language itself is case-insensitive and identifiers have to be back-quoted for the property to really have an effect. From storage point of view, if a table or query does use two columns that only differ in case then that will cause trouble even in bare Spark despite this property, so overall its best to avoid it as it serves no useful purpose.

Fix for 2 and 3 is being tested. As of now only lower-case names work in that JDBC API.

parimalaadinarayan commented 3 years ago

@sumwale Thanks for the reply. We are in the verge of upgrade, Please let us know when is the release planned for the fix.

sumwale commented 3 years ago

Fixed with the merge of https://github.com/TIBCOSoftware/snappy-store/pull/565 and https://github.com/TIBCOSoftware/snappydata/pull/1560

@parimalaadinarayan official release will be with SnappyData 1.3.0 in a month or so which will also open-source most of the currently closed-source portions (security, offheap, AQP, connectors).

sumwale commented 3 years ago

@parimalaadinarayan You could also clone the current master and build only the jdbc jar using "./gradlew jdbcJar" which will create the jar in build-artifacts/scala-2.11/distributions/

sumwale commented 2 years ago

@parimalaadinarayan FYI the latest 1.3.0 release includes this fix and many others: https://github.com/TIBCOSoftware/snappydata/releases/