Open nitinmnsn opened 2 years ago
Yeah, you described all correctly. The order isn't guaranteed in general. Can you try to turn on compute.ordered_head
configuration?
I did that now with ps.set_option('compute.ordered_head', True)
. The results are exactly the same.
Order is not the only source of my confusion though. I do not understand a lot in the example I have described above
distributed
and distributed-sequence
default_index_type
and 1 partition for sequence
default_index_type
. I have checked my maxPartitionBytes and it is set to 128 MBs. spark.conf.get('spark.sql.files.maxPartitionBytes')
outputs '134217728b'
. distributed-sequence
Why is the order of indexes maintained?distributed
why are not there random, monotonically increasing indexes. Indexes are the same 1,2,3,4.. as they are in the other two default_index_type
configurationsMany thanks for the time and effort you are putting in to help me. :)
I do understand the number of partitions issue. It was a lapse in my understanding.
But, I do think that the default indexes are not generated correctly. Are there any updates on this?
This is done on pandas on pyspark but the same is true for koalas as well (at least for now when I tested last) There are 3 different kinds of default indexes in pandas on pyspark. I am not able to replicate their said behavior:
Setting up to test:
tests:
Question: Why is not the number of partitions 1 since for when the default index is set to 'sequence' all the data must be collected on a single node.
tests:
Questions: The dataframe being distributed to all 8 cores is the expected behaviour but, the indexes should not be ordered which they are. It seems this behaviour is also like
sequence
type default index only.tests:
Questions: This is also
sequence
type behaviour only. The index generated is an ordered sequence from 1 to wherever. It should be monotonically increasing numbers with an indeterministic gap.Can somebody please help me clarify what I am not understanding correctly and what is the exact expected behaviour for all three types of the default index?