COMBINE-lab / alevin-fry

🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
https://alevin-fry.readthedocs.io
BSD 3-Clause "New" or "Revised" License
156 stars 15 forks source link

alevin-fry not generating all required output files #121

Closed AnnaAMonaco closed 8 months ago

AnnaAMonaco commented 8 months ago

Hi, I'm trying to use alevin-fry to quantify 10X scRNA-seq data on a non-conventional model organism -- with an in-house reference. I mention this in case it could be the issue, as I have used alevin-fry before on D.melanogaster with no issue.

While running all steps (I put them below and the messages in a txt), I did not encounter any error messages or other warning that were obvious to me. However, it does not generate all needed files:

> checkAlevinInputFiles(baseDir = baseDir)
Error in checkAlevinInputFiles(baseDir = baseDir) : 
  Input directory not compatible with Salmon v0.14 or newer (without external whitelist), the following required file(s) are missing or malformed:
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/raw_cb_frequency.txt
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/featureDump.txt
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/aux_info/alevin_meta_info.json
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/whitelist.txt

Input directory not compatible with Salmon v0.14 or newer (with external whitelist), the following required file(s) are missing or malformed:
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/raw_cb_frequency.txt
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/featureDump.txt
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/aux_info/alevin_meta_info.json

Input directory not compatible with Salmon v0.14 or newer (without final whitelist), the following required file(s) are missing or malformed:
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/raw_cb_frequency.txt
../outs/mpimg_L2917

It seems like a version issue maybe (this in in R from alevinQC_1.14.0), but I am using the up to date versions for the quantification (salmon 1.10.2 and alevin-fry 0.8.2). This is what is in the output and alevin directories:

> ls -al outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg
total 4477344
drwxrws---  6 amonaco mmonagrp       4096 Oct  9 09:32 .
drwxrws--- 12 amonaco mmonagrp       4096 Oct  4 17:34 ..
drwxrws---  2 amonaco mmonagrp        120 Oct  9 09:32 alevin
drwxrws---  2 amonaco mmonagrp         36 Oct  9 08:56 aux_info
-rw-rw----  1 amonaco mmonagrp        725 Oct  5 14:57 cmd_info.json
-rw-rw----  1 amonaco mmonagrp        368 Oct  9 09:31 collate.json
-rw-rw----  1 amonaco mmonagrp    5495361 Oct  9 09:32 featureDump.txt
-rw-rw----  1 amonaco mmonagrp       1512 Oct  9 09:31 generate_permit_list.json
drwxrws---  2 amonaco mmonagrp         10 Oct  5 14:57 libParams
drwxrws---  2 amonaco mmonagrp         38 Oct  5 14:57 logs
-rw-rw----  1 amonaco mmonagrp 1589027263 Oct  9 09:31 map.collated.rad
-rw-rw----  1 amonaco mmonagrp 2699377127 Oct  9 08:56 map.rad
-rw-rw----  1 amonaco mmonagrp    1323496 Oct  9 09:31 permit_freq.bin
-rw-rw----  1 amonaco mmonagrp   10009096 Oct  9 09:31 permit_map.bin
-rw-rw----  1 amonaco mmonagrp       5678 Oct  9 09:32 quant.json
-rw-rw----  1 amonaco mmonagrp  278547192 Oct  9 08:56 unmapped_bc_count.bin
-rw-rw----  1 amonaco mmonagrp     971312 Oct  9 09:31 unmapped_bc_count_collated.bin

> ls -al outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/
total 232164
drwxrws--- 2 amonaco mmonagrp       120 Oct  9 09:32 .
drwxrws--- 6 amonaco mmonagrp      4096 Oct  9 09:32 ..
-rw-rw---- 1 amonaco mmonagrp       560 Oct  9 08:56 alevin.log
-rw-rw---- 1 amonaco mmonagrp 236002738 Oct  9 09:32 quants_mat.mtx
-rw-rw---- 1 amonaco mmonagrp    312478 Oct  9 09:32 quants_mat_cols.txt
-rw-rw---- 1 amonaco mmonagrp   1406189 Oct  9 09:32 quants_mat_rows.txt

These the commands ran:

/home/amonaco/miniconda3/bin/salmon alevin -lISR -1 $input/mpimg_L29179-1_AAL-104-1_S7_Lmrg_R1_001.fastq.gz -2 $input/mpimg_L29179-1_AAL-104-1_S7_Lmrg_R2_001.fastq.gz --chromiumV3 -i $index -p 12 -o $outDir/mpimg_L29179-1_AAL-104-1_S7_Lmrg --tgMap $tsv --sketch --dumpFeatures
/home/amonaco/miniconda3/bin/alevin-fry generate-permit-list --input $outDir/mpimg_L29179-1_AAL-104-1_S7_Lmrg --expected-ori fw --output-dir $outDir/mpimg_L29179-1_AAL-104-1_S7_Lmrg --unfiltered-pl $whitelist
/home/amonaco/miniconda3/bin/alevin-fry collate -i $outDir/mpimg_L29179-1_AAL-104-1_S7_Lmrg -r $outDir/mpimg_L29179-1_AAL-104-1_S7_Lmrg -t 30
/home/amonaco/miniconda3/bin/alevin-fry quant -i $outDir/mpimg_L29179-1_AAL-104-1_S7_Lmrg --tg-map $tsv -t 30 -r cr-like -o $outDir/mpimg_L29179-1_AAL-104-1_S7_Lmrg

Note: $whitelist is just the list of 10X barcodes 3M-february-2018.txt; I checked and the transcript names in the txp2gene.tsv correspond to the header lines of the transcriptome fasta used for indexing.

Maybe with the outputs you could help me troubleshoot. If you need any more information that I did not provide just let me know.

Anna

ateAlb_alevin-fry_logs.txt

rob-p commented 8 months ago

Hi @AnnaAMonaco,

I'm pinging @DongzeHE on this as well so he's in the loop. From a quick look, it was not immediately clear to me what file(s) were missing. Is there a specific file that e.g. AlevinQC is complaining about?

Thanks, Rob

AnnaAMonaco commented 8 months ago

Hi @rob-p,

Yes, AlevinQC is lamenting the lack of the following files (mostly in the alevin/ output directory):

../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/raw_cb_frequency.txt
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/featureDump.txt
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/aux_info/alevin_meta_info.json
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/whitelist.txt
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/raw_cb_frequency.txt
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/featureDump.txt
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/aux_info/alevin_meta_info.json
../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/raw_cb_frequency.txt

Most importantly, I'm missing the 'alevin/quants_mat.gz' file, so even if I try to use the .mtx file I cannot load it for the generation of a SCE or Seurat object.

> files <- file.path("../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg/alevin/quants_mat.mtx")
> file.exists(files)
[1] TRUE
> # Reading in the alevin quants quants
> txi <- tximport(files, type="alevin")
Error in readAlevin(files, dropInfReps, filterBarcodes, tierImport, forceSlow,  : 
  expecting 'files' to point to 'quants_mat.gz' file in a directory 'alevin'
  also containing 'quants_mat_rows.txt' and 'quant_mat_cols.txt'.
  please re-run alevin preserving output structure

Anna

AnnaAMonaco commented 8 months ago

UPDATE

I tried a different load approach and apparently works:

> fryDir <- "../outs/mpimg_L29179-1_AAL-104-1_S7_Lmrg"
> se <- loadFry(fryDir, outputFormat = "scRNA")
> assayNames(se)
> srt <- CreateSeuratObject(
    counts = assay(se, "counts"), 
    project = "ateAlb_test", 
    meta.data=as.data.frame(colData(se)))
> srt
An object of class Seurat 
29282 features across 82717 samples within 1 assay 
Active assay: RNA (29282 features, 0 variable features)
> head(srt@meta.data)
                  orig.ident nCount_RNA nFeature_RNA         barcodes
CTTGATTTCAAACTGC ateAlb_test      10105         3572 CTTGATTTCAAACTGC
CTACATTAGTTGGCTT ateAlb_test       9679         3730 CTACATTAGTTGGCTT
GTTGTCCGTGTCCATA ateAlb_test      10558         2994 GTTGTCCGTGTCCATA
TTCGCTGCAGTCAACT ateAlb_test      10498         3367 TTCGCTGCAGTCAACT
TGAGCGCCAGACGCTC ateAlb_test       9775         3442 TGAGCGCCAGACGCTC
ACTATTCTCTGCCTCA ateAlb_test       9937         3670 ACTATTCTCTGCCTCA

So the issue seems to be with just alevinQC.

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: MarIuX64 2.0 GNU/Linux

Matrix products: default
BLAS:   /pkg/R-4.2.2-0/lib/R/lib/libRblas.so
LAPACK: /pkg/R-4.2.2-0/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] SeuratObject_4.1.4          Seurat_4.3.0.1
 [3] fishpond_2.4.1              scater_1.26.1
 [5] scran_1.26.2                scuttle_1.8.4
 [7] SingleCellExperiment_1.20.1 SummarizedExperiment_1.28.0
 [9] Biobase_2.58.0              GenomicRanges_1.50.2
[11] GenomeInfoDb_1.34.7         IRanges_2.32.0
[13] S4Vectors_0.36.2            BiocGenerics_0.44.0
[15] MatrixGenerics_1.10.0       matrixStats_0.63.0
[17] tximeta_1.16.0              tximport_1.26.1
[19] vioplot_0.4.0               zoo_1.8-11
[21] sm_2.2-5.7.1                ggplot2_3.4.0
[23] tibble_3.2.1                dplyr_1.1.3
[25] alevinQC_1.14.0
rob-p commented 8 months ago

Thanks for investigating further and reporting back here! Tagging @csoneson to comment on the AlevinQC issue.

DongzeHE commented 8 months ago

Hello @AnnaAMonaco,

As alevin-fry uses a different output layout than the original alevin, we made another phase in alevinQC, corresponding to those functions named "blablaalevinfryblabla". More details can be found in the Manual of the alevinQC package (https://bioconductor.org/packages/release/bioc/manuals/alevinQC/man/alevinQC.pdf). For example, in your case, you can try the alevinFryQCReport function discussed on page 10 of the Manual, where the mapDir, permitDir, and quantDir corresponds to the output of the salmon alevin, alevin-fry generate-permit-list and alevin-fry quant programs.

Please let us know if you have any problem running alevinfryQC!

Best, Dongze

AnnaAMonaco commented 8 months ago

Thanks @DongzeHE, I tried alevinfryQCreport and it worked out :) But I has to run it without permitDir as the "all_freq.tsv" is one of the files that was not generated.