Use Datasets to gain the advantages of the optimizers spark uses for Datasets
This does not seem like a good idea, since scala (the compiler) somehow requires an encoder for the result of every map/flatMap/etc. operation (the implicits don't work correctly).
The type information also gets lost somehow when transforming a Dataset back to an RDD.
Investigate if/how the data is distributed unequally since a few tasks take much longer than the rest
The max block size was set to 500.000 in every config. This caused the Blocking to only create few small blocks that are processed on only a few nodes.
After multiple tries i found a maximum size of 20.000 to work with a partition size of 512. I also removed 2 of the 4 blocking schemes that were used (the sector and last letter schemes were removed).
Use Datasets to gain the advantages of the optimizers spark uses for DatasetsThis does not seem like a good idea, since scala (the compiler) somehow requires an encoder for the result of every map/flatMap/etc. operation (the implicits don't work correctly). The type information also gets lost somehow when transforming a Dataset back to an RDD.Investigate if/how the data is distributed unequally since a few tasks take much longer than the rest The max block size was set to
500.000
in every config. This caused the Blocking to only create few small blocks that are processed on only a few nodes. After multiple tries i found a maximum size of20.000
to work with a partition size of512
. I also removed 2 of the 4 blocking schemes that were used (the sector and last letter schemes were removed).