TileDB-Inc / TileDB-Spark

Spark interface to the TileDB storage manager
MIT License
15 stars 3 forks source link

no duplicates allowed? #154

Closed bruno-ariano closed 6 days ago

bruno-ariano commented 3 weeks ago

When writing a dataframe in a tiledb format i got the following error:

Error: Duplicate coordinates (17437310, T2, CD1) are not allowed

I thought tiledb-spark allowed for sparse arrays?

leipzig commented 3 weeks ago

Hi Bruno, can you check that you allowed duplicates when creating the schema? https://docs.tiledb.com/main/how-to/arrays/creating-arrays/creating-the-array-schema#allowing-duplicates

DimitrisStaratzis commented 3 weeks ago

Hi @bruno-ariano, Thank you for pointing this out! To add to @leipzig's response It looks like our README was missing the explanation for the schema.set_allows_dups write option. For more information, please refer to #155

Update: Our main branch has been updated to include the option description

bruno-ariano commented 3 weeks ago

Thank you @leipzig and @DimitrisStaratzis for the quick answer. Using the schema.set_allows_dups as shown below solved the issue .

>>>test_write.write.format("io.tiledb.spark").option("schema.dim.0.name", "position").option("schema.dim.1.name", "g1").option("schema.dim.2.name", "c2").option("schema.set_allows_dups", True).save("test_spark_tiledb")

However it seems now I have another problem. When I load the saved data, the queries do not work properly on the position column as you can see below.

>>>df = spark.read.format("io.tiledb.spark").load("test_spark_tiledb")
>>>df.createOrReplaceTempView("tiledbArray")
>>> sql_df = spark.sql("SELECT * FROM tiledbArray WHERE position < 16849573")
>>> sql_df.show()
+--------+---------------+---------+---------------+--------------------+--------------------+------------------+
|position|           g1|    c2|            S|                beta|           p-value|
+--------+---------------+---------+---------------+--------------------+--------------------+------------------+
|23538587|E30|C1|C_23538|  0.11|    0.186792280441147|
|23538587|G20|C1|C_23538|  0.21|   0.644820366777995|

At first I thought that when I was saving the data using the sparse and duplicate mode the positions were somehow mapped to some other indexes or maybe was a considered as string, however the schema as you can see show the data being loaded as integer so I am not sure why this is happening.

>>>df.schema

StructType([StructField('position', IntegerType(), False), StructField('g1', StringType(), False), StructField('c2', StringType(), False), StructField('S', StringType(), True), StructField('beta', DoubleType(), True), StructField('p-value', DoubleType(), True)])

Thank you

DimitrisStaratzis commented 3 weeks ago

If possible could you please provide the code you have used to create the dataframe along with any data files (e.g. csv) you used? I could then try to reproduce your issue.

Thank you!

bruno-ariano commented 2 weeks ago

Hi @DimitrisStaratzis, unfortunately I cannot share any part of my data though I was able to recreate the issue by generating a random dataframe following these instructions:

First create a random dataframe and write it somewhere

import pandas as pd
import numpy as np

# Number of rows
num_rows = 10000

# Create the 'position' column as an integer ranging from 0 to 9999
positions = np.arange(num_rows)

# Create the 'g1' column with a different string every 10 lines
g1_values = ['Group_' + str(i // 10) for i in range(num_rows)]

# Create the 'type' column with a different string every 100 lines
type_values = ['Type_' + str(i // 100) for i in range(num_rows)]

# Create the 'beta' and 't-stat' columns as random floats
beta_values = np.random.randn(num_rows)
t_stat_values = np.random.randn(num_rows)

# Combine all columns into a DataFrame
df = pd.DataFrame({
    'position': positions,
    'g1': g1_values,
    'type': type_values,
    'beta': beta_values,
    't-stat': t_stat_values
})

df.to_csv('dataframe_output_test_tiledb_spark_error.csv', index=False)

Then read it back using Spark, convert it in TileDB, read it back and query using sql format

schema = StructType([
    StructField("position", IntegerType(), True),
    StructField("g1", StringType(), True),
    StructField("type", StringType(), True),
    StructField("beta", DoubleType(), True),
    StructField("t-stat", DoubleType(), True)
])

test_spark = spark.read.format("io.tiledb.spark").option("delimiter",",").csv("dataframe_output_test_tiledb_spark_error.csv",header=True,schema = schema)

test_spark.write.format("io.tiledb.spark").mode('overwrite').option("schema.dim.0.name","position").option("schema.dim.1.name", "g1").option("schema.dim.2.name", "type").option("schema.set_allows_dups", True).save("test_spark_tiledb_error")

df = spark.read.format("io.tiledb.spark").load("test_spark_tiledb_error")
df.createOrReplaceTempView("tiledbArray")
sql_df = spark.sql("SELECT * FROM tiledbArray WHERE position < 1345")
sql_df.show()

In this case instead of getting the wrong results I get the error:

io.tiledb.java.api.TileDBError: TileDB internal: Lower range bound 10864 cannot be larger than the higher bound 1344

However when I filter for position higher than 1345 then I get the results

sql_df = spark.sql("SELECT * FROM tiledbArray WHERE position > 1345")
sql_df.show()

image

Thank you

DimitrisStaratzis commented 1 week ago

Thank you for providing the example; it will be very helpful. We'll investigate and get back to you with updates as soon as possible.

DimitrisStaratzis commented 1 week ago

Hi @bruno-ariano ,

Here is how to use the csv to create the TileDB array correctly.

  1. Create the csv file, no changes here:
    
    import pandas as pd
    import numpy as np

Number of rows

num_rows = 10000

Create the 'position' column as an integer ranging from 0 to 9999

positions = np.arange(num_rows)

Create the 'g1' column with a different string every 10 lines

g1values = ['Group' + str(i // 10) for i in range(num_rows)]

Create the 'type' column with a different string every 100 lines

typevalues = ['Type' + str(i // 100) for i in range(num_rows)]

Create the 'beta' and 't-stat' columns as random floats

beta_values = np.random.randn(num_rows) t_stat_values = np.random.randn(num_rows)

Combine all columns into a DataFrame

df = pd.DataFrame({ 'position': positions, 'g1': g1_values, 'type': type_values, 'beta': beta_values, 't-stat': t_stat_values })

df.to_csv('dataframe_output_test_tiledb_spark_error.csv', index=False)


2. Read the csv file as a dataframe, write the dataframe to a TileDB array and read it back with filtering
  ```java
// read the csv to a spark dataframe
  Dataset<Row> ds = session().read().option("header", true).csv(pathFinder("test.csv"));
  String URI = "test_spark_tiledb_error";

// Write the dataframe to a TileDB array
  ds.write()
          .format("io.tiledb.spark")
          .option("schema.dim.0.name","position")
          .option("schema.dim.1.name", "g1")
          .option("schema.dim.2.name", "type")
          .option("schema.set_allows_dups", true)
          .mode("overwrite")
          .save(URI);

// query the TileDB array
  Dataset<Row> dfRead = session().read().format("io.tiledb.spark").load(URI);
  dfRead.createOrReplaceTempView("array");
  List<Row> rows = session().sql("SELECT * FROM array WHERE position < 200").collectAsList();

  for (Row row : rows) {
    System.out.println(row);
  }

This is what I got. Filtering works correctly.


[1,Group_0,Type_0,0.7140633613863734,-1.5607897006032276]
[2,Group_0,Type_0,-0.43556315490071923,0.11598187618451235]
[20,Group_2,Type_0,0.33859746164880045,0.5415691069765249]
[21,Group_2,Type_0,-2.6373976798948218,0.6626679724836054]
[22,Group_2,Type_0,-0.1578434562049294,1.525280443033808]
[23,Group_2,Type_0,-0.8054924900371718,0.5219545805367286]
[24,Group_2,Type_0,-0.5230043662208246,-2.4821286416718777]
[25,Group_2,Type_0,0.7893672114777848,0.3872085673829485]
[26,Group_2,Type_0,0.028572200440293936,0.5275090721884008]
[27,Group_2,Type_0,0.6361561139369373,0.547836282627302]
[28,Group_2,Type_0,-0.6270080357212171,-0.7351686044047386]
[29,Group_2,Type_0,-1.5849820730485475,0.4779947126990583]
[3,Group_0,Type_0,0.7789784730071696,-1.28372834494457]
[4,Group_0,Type_0,-0.03983498011242886,-0.9151778110281286]
[40,Group_4,Type_0,0.1906891384641098,0.6633156188181292]
[41,Group_4,Type_0,-0.2030681698098284,-0.6252782783799097]
[42,Group_4,Type_0,-0.15976985729555104,0.04974440083862088]
[43,Group_4,Type_0,-0.9936483951229517,0.23009331642449832]
[44,Group_4,Type_0,-0.11716288209360497,1.134481637646418]
[45,Group_4,Type_0,0.02885004417749569,0.6095860093275269]
[46,Group_4,Type_0,-1.0777671073687984,-0.11901481637395012]
[47,Group_4,Type_0,0.823872653006456,-0.3403317670263348]
[48,Group_4,Type_0,0.16487323447038116,-0.27453979999337563]
[49,Group_4,Type_0,-0.3574031550506953,1.0827697656937363]
[5,Group_0,Type_0,-1.0298800617927897,1.1753829714698938]
[6,Group_0,Type_0,0.27788387520122976,-0.008070715859147738]
[60,Group_6,Type_0,-0.28217364191316163,0.3431937937585599]
[61,Group_6,Type_0,-0.3760760505614729,-0.47693919576911475]
[62,Group_6,Type_0,-0.6936077098298356,0.6341519704108778]
[63,Group_6,Type_0,0.2535993137909195,-0.9102450575537355]
[64,Group_6,Type_0,0.2829082825076695,0.025227076444501888]
[65,Group_6,Type_0,-0.302752629460327,0.0399990425261961]
[66,Group_6,Type_0,-2.2906126323906717,-1.1567696915998409]
[67,Group_6,Type_0,-0.0888442493881703,1.1634481764542677]
[68,Group_6,Type_0,1.0191940163546103,1.1767929469996714]
[69,Group_6,Type_0,-0.4965430698363608,1.318304662962034]
[7,Group_0,Type_0,1.069131299873406,1.199642471446047]
[8,Group_0,Type_0,-1.041059156459765,-0.828077499794811]
[80,Group_8,Type_0,-0.45808353389895196,-0.3504367313323167]
[81,Group_8,Type_0,0.9758261671164112,0.3406078590005372]
[82,Group_8,Type_0,1.544385680082129,0.7085669113602985]
[83,Group_8,Type_0,0.016957781578680088,0.9463766139949195]
[84,Group_8,Type_0,-1.4694177455611175,0.1469995200983268]
[85,Group_8,Type_0,0.2847485198612063,0.5535947026414727]
[86,Group_8,Type_0,-0.36304407096135893,0.18250614055637476]
[87,Group_8,Type_0,-0.8813219532613537,1.570899119745016]
[88,Group_8,Type_0,-0.03339645223136862,-2.155072608193449]
[89,Group_8,Type_0,0.5710856150934347,-0.7448251587182706]
[9,Group_0,Type_0,-0.8245724150968157,-0.1270826234139261]```
bruno-ariano commented 6 days ago

Thanks @DimitrisStaratzis