asardaes / dtwclust

R Package for Time Series Clustering Along with Optimizations for DTW
https://cran.r-project.org/package=dtwclust
GNU General Public License v3.0
252 stars 29 forks source link

segfault error using dtwclust #71

Closed hichew22 closed 1 month ago

hichew22 commented 1 month ago

I am trying to use dtwclust as follows. I have a large dataframe consisting of longitudinal Y values (1 value per day, 31 values per individual, and ~700 individuals). There are no missing values.

df_ts <- tsmatrix(df, response = "Y")
dtwclust::tsclust(
  df_ts,
  k = 4L
)

This code chunk works when I am directly calling it, but when I try to knit, the following segfault error occurs:

image

Could you help me with this?

asardaes commented 1 month ago

hm, the top-most call seems to be done by RcppParallel, so it's likely related to that. Did you customize its installation in any way?

hichew22 commented 1 month ago

I don’t think so, since I don’t recall installing that specific package on its own (may have installed it as a dependency?). Should I try removing and then re-installing?

asardaes commented 1 month ago

Not sure, I actually don't know if RcppParallel still dynamically compiles its backend or not. What CPU do you have?

asardaes commented 1 month ago

Ah I missed the part about knitting, then I don't know if the traceback is super accurate, a quick search returned this, although that's a bit old, but maybe your setup suffers from something similar?

hichew22 commented 1 month ago

I think it’s closer to this situation: https://discourse.mc-stan.org/t/segfault/3240/6

Have you encountered this for any of your users?

I also wonder if there is a way to set the RCPP_PARALLEL_NUM_THREADS environment or disable parallel computation as suggested here? https://github.com/philips-software/latrend/issues/159 This is how I originally discovered the problem. Interestingly, when I knit my .Rmd file with the latrendData, I don’t get this error. But when I use my dataset (which is larger), I do.

asardaes commented 1 month ago

I imagine this would work (note the comment from Jaleks).

However, if reading the env var was the problem, I think the call to defaultNumThreads wouldn't appear in the traceback, so I still think it would be interesting to know which CPU you have.

hichew22 commented 1 month ago

I have the M3 MacBook Pro with 11‑core CPU, 14‑core GPU, 16‑core Neural Engine. Is that the information you’re looking for? Let me know if this would change what I should try. Should I try to set something with the defaultNumThreads?

asardaes commented 1 month ago

Could you post the basic config of the document you're trying to knit? And you're knitting from rstudio?

There seems to be a lot of examples of segfaults when I search for R and apple silicon, so it might be tricky, but I can at least try to reproduce. Maybe check this comment.

hichew22 commented 1 month ago

I figured out what is causing the issue, but not sure why it is.

The basic configuration of my Rmd document is: 1) Import necessary packages:

library(openxlsx)
library(here)
library(janitor)
library(tidyverse)
library(labelled)
library(lubridate)
library(zoo)
library(latrend)
library(kmlShape)
library(plotly)

2) Wrangle my dataset, which is a long dataframe consisting of 31 daily values for 700 individuals.

3) Cluster the longitudinal data using {kmlShape} 4) Cluster the longitudinal data using the KML method from {latrend} 5) Cluster the longitudinal data using the DTW method from {latrend} (which uses dtwclust) as so:

# Specify trajectory identifier and time variables
options(latrend.id = "id", latrend.time = "time")

# Fit DTW model with 4 clusters
dtw_method <-
  lcMethodDtwclust(response = "value",
                   nClusters = 4,
                   nbRedrawing = 1)
dtw_method

dtw_model <- latrend(dtw_method, data = df)

At this last line above, I get the segfault error. This segfault error occurs even when I call

df_ts <- tsmatrix(df, response = "value")
dtwclust::tsclust(
  df_ts,
  k = 4L
)

directly.

(step 6 - onwards: try other clustering methods on my dataset, using using {latrend})

When I remove: library(kmlShape)and the chunks for steps 3 and 4, the document knits correctly. When I add back any one of these 3 components, I get the segfault error. I'm not sure why step 5 does not like these. Do you know why this may be?

asardaes commented 1 month ago

I rather meant the metadata from the document, things like these:

---
title: "blah"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using table.express}
  %\VignetteEngine{knitr::rmarkdown_notangle}
  %\VignetteEncoding{UTF-8}
---

Also, from what I can tell, kmlShape was archived in CRAN, and its last status shows the comiled code had some significant warnings, so that could be an issue out of my control. The kml package from the same authors seems to be updated, does that one provide the same functionality you're currently using?

hichew22 commented 1 month ago

Indeed it seems to be an issue with kmlShape. When I remove all other code chunks from the Rmd and try to knit, it gives me a segfault C stack overflow error.

When I remove all kmlShape-related chunks, the document knits completely (including any code chunks using the latrend package).

I'm not sure what the issue with kmlShape is, but I will just remove that from my Rmd.

Thank you so much for your help looking into this issue!