-
The solution provided by the team for the issue of not being able to execute other cells after cancelling out another cell (and having this one "frozen" for a long time) is not a solution good enough,…
-
I am using Databricks witha Delta Lake in the background. Databricks runtime is 10.2, with sparklyr 1.7.2.
```r
sc % filter(yearMonth >= 201801) %>% filter(yearMonth < 202202)
```
is much faste…
-
Spark 2.3 introduced a `repartitionByRange` option on dataframes. This could be used to improve the efficiency of `SortFullGroup` in the Parquet store (possibly avoiding the need to use RDDs, which co…
-
**What is your question?**
This is a question to the spark team of rapids. As part of cuIO refactor, We(rapids cudf team) are currently working on adding fuzz testing coverage for our avro reader…
-
### Topic Suggestion
Creating a PySpark DataFrame: A Beginner's Guide
#### Proposed article introduction
We can distribute data and conduct calculations on several nodes of a cluster using Spark, a…
-
Using Spark version 3.2.1
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.14.1)
I load the xml files below. First one to establish the schema and the second with the actual insta…
-
**Is your feature request related to a problem? Please describe.**
Problem: Need to calculate the similarity between texts stored in 2 columns of the same or different dataframes
For example, the…
-
For test case:
```
test("test dataFrameComparer") {
val df1 = spark.createDataFrame(
spark.sparkContext.emptyRDD[Row],
StructType(
List(
StructField("neste…
-
Sparklyr fails to parse `dplyr` syntax that uses [`across`](https://dplyr.tidyverse.org/reference/across.html) function.
# Example
```r
# Settings
library("sparklyr", quietly = FALSE)
library("…
-
Problem:
With real-world Spark dataframes (e.g. 50 vector-assembled columns with real values, 130000 rows), I get this "An active CatBoost worker is already present in the current process" error when…