astronomy-commons / axs

Astronomy eXtensions for Spark: Fast, Scalable, Analytics of Billion+ row catalogs
https://axs.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
23 stars 12 forks source link

Metadata table misnamed #1

Closed enriqueutrilla closed 5 years ago

enriqueutrilla commented 5 years ago

I think there is a bug in the method getSparkTableId() of CatalogUtils.java (commit f97d002228b288e12da0fa3e34b41bb066711c22). This method is trying to get the TBL_ID from the table HIVE_SCHEMA+"TBLS", but this table does not exist. The table created in the method createDbTableIfNotExists() and used in all other methods of that class is "AXSTABLES".

Thus, when trying to save a table with AxsCatalog.save_axstable(), an exception is thrown: Py4JJavaError: An error occurred while calling o46.saveNewTable. : java.sql.SQLSyntaxErrorException: Table/View 'APP.TBLS' does not exist. at org.apache.derby.client.am.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source) at org.apache.derby.client.am.ClientStatement.executeQuery(Unknown Source) at org.dirac.axs.util.CatalogUtils.getSparkTableId(CatalogUtils.java:134) at org.dirac.axs.util.CatalogUtils.saveNewTable(CatalogUtils.java:203) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: ERROR 42X05: Table/View 'APP.TBLS' does not exist. at org.apache.derby.client.am.ClientStatement.completeSqlca(Unknown Source) at org.apache.derby.client.net.NetStatementReply.parsePrepareError(Unknown Source) at org.apache.derby.client.net.NetStatementReply.parsePRPSQLSTTreply(Unknown Source) at org.apache.derby.client.net.NetStatementReply.readPrepareDescribeOutput(Unknown Source) at org.apache.derby.client.net.StatementReply.readPrepareDescribeOutput(Unknown Source) at org.apache.derby.client.net.NetStatement.readPrepareDescribeOutput(Unknown Source) at org.apache.derby.client.am.ClientStatement.readPrepareDescribeOutput(Unknown Source) at org.apache.derby.client.am.ClientStatement.flowExecute(Unknown Source) at org.apache.derby.client.am.ClientStatement.executeQueryX(Unknown Source) ... 14 more

(A Derby database had been setup as HIVE metastore)

zecevicp commented 5 years ago

Hi, Enrique Table TBLS is a Spark-generated (or rather Hive-generated) metastore table that contains info about tables that Spark manages. AXS reads this information and augments it with information in its own metastore table called AXSTABLES. So this is not an error and the problem is somewhere else.

Can you please open a spark-shell (not pyshell), inside the directory where metastore_db Derby folder is located, and execute the following code in the shell:

Class.forName("org.apache.derby.jdbc.EmbeddedDriver")
val conn = java.sql.DriverManager.getConnection("jdbc:derby:metastore_db;create=true", "", "")
val stmt = conn.createStatement()
val rs = stmt.executeQuery("select s.SCHEMANAME, t.TABLENAME from SYS.SYSTABLES t, SYS.SYSSCHEMAS s where t.schemaid = s.schemaid")
while(rs.next()) {
println(rs.getString(1)+"."+rs.getString(2))
}

Then report back the results.

zecevicp commented 5 years ago

Oh, maybe you are using Derby in server mode? Is that the case?

enriqueutrilla commented 5 years ago

Yes, indeed. I did setup a separate Derby process as metastore.

zecevicp commented 5 years ago

Do you have the connection details in hive-site.xml?

enriqueutrilla commented 5 years ago

Yes, indeed. I initially forgot about that and I got a different error just by importing AxsCatalog, so I just replicated the service that I have been using in my Spark cluster (set as a separate, not-embedded service so that we could have several jupyter notebooks open at once)

zecevicp commented 5 years ago

What is the contents of your hive-site.xml?

enriqueutrilla commented 5 years ago
<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby://*MY_SPARK_MASTER_NODE*:30009/memory:hive_metastore;create=true</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.ClientDriver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>*REDACTED*</value>
</property>
</configuration>

I have also added this morning the following two properties so that the schema is generated automatically, but they don't seem to work:

<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>true</value>
</property>

<property>
  <name>datanucleus.fixedDatastore</name>
  <value>false</value>
</property>
zecevicp commented 5 years ago

When you open spark-shell and execute this line: spark.sparkContext().hadoopConfiguration().get("javax.jdo.option.ConnectionDriverName") what do you get?

zecevicp commented 5 years ago

I'm guessing that your configuration is not getting picked up as it ordinarily should and AXS falls back to the local embedded Derby database. Naturally, the table there doesn't exist, hence the exception. If that's the case, the hadoop config above should be empty and AXS should be changed to obtain the configuration in some other way. But first we need to diagnose the problem so please send the output of the above command. Thanks

enriqueutrilla commented 5 years ago

Hi Petar,

I've done that, and I'm getting null. But opening a jupyter notebook, and creating an AxsCatalog resulted on the AXSTABLES table being created (I was able to check that using an SQL client), so the properties are read.

I was able to fix it, using Hive's schematool script to populate the schema. The trick is that in that case no username/password must be defined. Otherwise the tables must be in the same user schema for Hive's ObjectStore to be initialized properly, but then AxsCatalog (which assumes that the schema is going to be "APP" for all Derby databases) can not find them.

As an alternative, I also checked that setting the property datanucleus.autoCreateSchema=true does create the tables as they are accessed. The trick is that it does it lazily, e.g. after running dataframe.saveAsTable(), and so AxsCatalog, that uses direct JDBC rather than datanucleus, will fail if it is run before any such Spark SQL commands.

Thank you very much for your assistance,

zecevicp commented 5 years ago

Glad to hear you have it working now. I'll think about how to support that use case in the future.