databio / pepatac

A modular, containerized pipeline for ATAC-seq data processing
http://pepatac.databio.org
BSD 2-Clause "Simplified" License
51 stars 13 forks source link

Error when running project level pipeline: Error in data.table::rbindlist(sapply(project_samples, FUN = yamlToDT, : Item 1 of input is not a data.frame, data.table or list #264

Closed donaldcampbelljr closed 5 months ago

donaldcampbelljr commented 5 months ago

Currently when running samples, sample level executes fine but the project-level pipeline errors during PEPATAC_summarizer.R when creating summary plots.

Error in data.table::rbindlist(sapply(project_samples, FUN = yamlToDT,  :
  Item 1 of input is not a data.frame, data.table or list

Samples from installation tutorial working fine. Confirmed again using newest builds (PEPATAC v 0.11.0)

More error detail from summary/PEPATAC_log.md

> `Rscript /home/zzz3fh/pepatac_tutorial//tools/pepatac/tools/PEPATAC_summarizer.R /home/zzz3fh/t1d_pep_rivanna/t1d_project_config.yaml /project/shefflab/processed/pepatac_t1d/processed/results_pipeline /project/shefflab/processed/pepatac_t1d/processed/results_pipeline 2 5 1` (885189)
<pre>
Loading config file: /home/zzz3fh/t1d_pep_rivanna/t1d_project_config.yaml
Creating assets summary...
Summary (n=2): /project/shefflab/processed/pepatac_t1d/processed/results_pipeline/t1d_atac_assets_summary.tsv
Creating summary plots...
Error in data.table::rbindlist(sapply(project_samples, FUN = yamlToDT,  : 
  Item 1 of input is not a data.frame, data.table or list
Calls: <Anonymous> -> <Anonymous>
Execution halted
</pre>
Command completed. Elapsed time: 0:00:04. Running peak memory: 0.373GB.  
  PID: 885189;  Command: Rscript;   Return code: 1; Memory used: 0.373GB

Confirmed: assets_summary.tsv and stats_summary.yaml are generated fine (results do exist for each sample) R debugging confirmed that the pep is being loaded

donaldcampbelljr commented 5 months ago

I believe I have tracked down the cause of this issue.

The tutorial example has two samples but for each sample they do not have the same number of reported results (39 vs 38 results). Tutorial 1 has an extra reported result (Frac_exp_unique_at_10M:).

When calling the function sapply(project_samples, FUN=yamlToDT, yaml_file=summary_file), the result returned for the tutorial examples is a list of lists: image

However, in the case of my real world samples, where each sample as the exact sample number of results, sapply returns a different data structure (R studio says it is a list, but it appears to be a data table): image

This difference produces an error when passed to data.table::rbindlist():

stats <- data.table::rbindlist(sapply(project_samples, FUN=yamlToDT,
                                      yaml_file=summary_file), fill=TRUE)

producing the error:

Error in data.table::rbindlist(test_sapply, fill = TRUE) : 
  Item 1 of input is not a data.frame, data.table or list
nsheff commented 5 months ago

side note: can you change from sapply to vapply? it's considered bad practice to use sapply (I think for exactly this reason)

donaldcampbelljr commented 5 months ago

It appears as though lapply is the solution (Thanks @jpsmith5 !): https://github.com/databio/pepatac/commit/0df70b647f5889b7218aefbf66679f726804f5bc

This clears the error and allows me to run the project-level pipeline on this set of samples.

I will do a point release later today that incorporates this fix.