UCSF-DSCOLAB / data_processing_pipelines

A repository to store the existing pipelines to process the various CoLabs datasets
0 stars 1 forks source link

Pre-qc V1 Refactor #11

Closed amadeovezz closed 1 year ago

amadeovezz commented 1 year ago

What

The original intention of this PR is to refactor the param parsing at the beginning of the workflow (pre-qc). However in order to do this - the scope of this work has been extended to a general refactor of the entire pipeline. The goal here is to ultimately improve readability and extensibility of this pipeline. However, before we can tackle extensibility, it is often helpful to abstract away and group together complicated functionality. This is the central focus of this PR.

TODO

CODE PATHS

Note: demuxlet punted to future PR's

REGRESSION testing

DEMUXLET ERRORS

Pre-refactor I am getting these errors when doublet_finder = true.

ERROR ~ Error executing process > 'FIND_DOUBLETS (1)'

Caused by:
  Process `FIND_DOUBLETS (1)` terminated with an error exit status (1)

Command executed:

  Rscript /c4/home/amazzara/data_processing_pipelines/single_cell_RNAseq/bin/find_doublets.R raw_feature_bc_matrix.h5 TEST-POOL-DM2-SCG1.clust1.samples.reduced.tsv TEST-POOL-DM2-SCG1 100 100 3 21212 /c4/home/amazzara/data_processing_pipelines/single_cell_RNAseq

Command exit status:
  1

Command output:
  [2023-09-19 12:32:07] Dimension of the GEX count data:11456 x 99
  [1] "orig.ident"   "nCount_RNA"   "nFeature_RNA" "DROPLET.TYPE" "BEST.GUESS"  

Command error:
  INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
  The legacy packages maptools, rgdal, and rgeos, underpinning this package
  will retire shortly. Please refer to R-spatial evolution reports on
  https://r-spatial.org/r/2023/05/15/evolution4.html for details.
  This package is now running under evolution status 0 
  Attaching SeuratObject
  -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
  v dplyr     1.1.2     v readr     2.1.4
  v forcats   1.0.0     v stringr   1.5.0
  v ggplot2   3.4.2     v tibble    3.2.1
  v lubridate 1.9.2     v tidyr     1.3.0
  v purrr     1.0.1     
  -- Conflicts ------------------------------------------ tidyverse_conflicts() --
  x dplyr::filter() masks stats::filter()
  x dplyr::lag()    masks stats::lag()
  i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  Genome matrix has multiple modalities, returning a list of matrices for this genome
  Error in sngObj@misc$scStat$fmlDropletTypeProp["DBL", ] : 
    subscript out of bounds
  Calls: runDoubletFinder
  Execution halted

Work dir:
  /c4/scratch/amazzara/nextflow/e8/eb21b3e1d752e13184359e518da9c0

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

To scope down this PR, we will punt this to the future.

FUTURE TODO's

BONUS

What testing looks like

Screenshot 2023-10-03 at 4 05 57 PM
erflynn commented 1 year ago

this is looking great - thank you!

amadeovezz commented 1 year ago

most of the work is actually in the pre_qc file which I'm going to push later today :)

amadeovezz commented 1 year ago

@erflynn the refactor in terms of code is complete - all other improvements will be punted to future PR's (this PR has already creeped in scope). Feel free to add any comments if you like :)

Tomorrow I'll be doing some regression testing as a final check (this will not change the structure of the code).

erflynn commented 1 year ago

This looks fantastic! Awesome work.

Really excited about the refactor and addition of tests, also the README updates are very helpful. I'll be trying this when I'm back, and can also help set up some test cases then.

I think good to merge, but worth flagging this to folks in the lab who might be using it that there is a major update in this version for "pre_qc" step.