COMBINE-lab / minnow

10 stars 2 forks source link

Less number of reads are produced that the truth matrix error #24

Closed baraaorabi closed 2 years ago

baraaorabi commented 2 years ago

What causes this error?

extern/minnow/build/src/minnow simulate --splatter-mode  --inputdir ../../analysis/simulation/minnow_splatter/C1/8000-5000-42  -r ../../analysis/simulation/minnow_ind
ex/ref_k101_fixed.fa  --g2t ../../analysis/simulation/salmon_tx2gene.tsv  --PCR 4  -e 0.01  -p 64  -o ../../analysis/simulation/minnow_simulate/C1/8000-5000-42  --dbg
  --gfa ../../analysis/simulation/minnow_index/dbg.gfa  -w extern/3M-february-2018.txt  --countProb ../../analysis/simulation/minnow_estimate/C1/countProb.txt  --cust
om  --gencode
[2021-11-15 16:45:02.496] [minnow-Log] [info] Reading reference sequences ...
[2021-11-15 16:45:03.670] [minnow-Log] [info] Reference sequence is loaded ...
[2021-11-15 16:45:03.972] [minnow-Log] [info] Skipped 131293 transcriptsbecause either short or not present in reference
[2021-11-15 16:45:03.972] [minnow-Log] [info] Number of genes in the txp2gene file: 20174
[2021-11-15 16:45:03.972] [minnow-Log] [info] Parsing ../../analysis/simulation/minnow_splatter/C1/8000-5000-42/quants_mat_cols.txt
=======================Reading Splatter Matrix=====================
[2021-11-15 16:45:03.974] [minnow-Log] [info] 5000 cells are present
[2021-11-15 16:45:03.974] [minnow-Log] [info] Start parsing Splatter output
[2021-11-15 16:45:03.974] [minnow-Log] [info] Parsing ../../analysis/simulation/minnow_splatter/C1/8000-5000-42/quants_mat_rows.txt
[2021-11-15 16:45:04.140] [minnow-Log] [info] Debug:: resizing 5000
[2021-11-15 16:45:04.199] [minnow-Log] [info] Debug:: reading splatter
Debug::matrix size 5000 x 8000
cell count 5000
In Splatter: Number of genes processed : 8000==================Done Parsing Splatter Matrix==================
[2021-11-15 16:45:07.124] [minnow-Log] [info] Splatter matrix is read, with dimension 5000 x 8000

 !!!!!!!!!!!!!!!!!! IN DBG MODE !!!!!!!!!!!!!!!!!!!!!!!
[2021-11-15 16:45:07.124] [minnow-Log] [info] Parsing GFA file ../../analysis/simulation/minnow_index/dbg.gfa
[2021-11-15 16:45:07.124] [minnow-Log] [info] Start loading segments...
[2021-11-15 16:45:07.124] [minnow-Log] [info] Predicted overlap size: 101
[2021-11-15 16:45:08.133] [minnow-Log] [info] Saw 459039 segment lines, number of unitigs 353320
[2021-11-15 16:45:08.152] [minnow-Log] [info] Calculated overlap size 101
[2021-11-15 16:45:08.152] [minnow-Log] [info] Start loading paths...
[2021-11-15 16:45:10.817] [minnow-Log] [info] Done with GFA Equivalence class size 257719 Segment map size after filtering 105719 number of transcripts 105719
[2021-11-15 16:45:10.857] [minnow-Log] [info] The size of the gene id pool 20174
In Splatter: Number of genes processed : 8000[2021-11-15 16:45:11.441] [minnow-Log] [warning] No RSPD file provided

[2021-11-15 16:45:11.441] [minnow-Log] [info] The size of probability Vector 129
SPLATTER MODE: After loading bfh prob size 8000
In Splatter: Number of cells processed : 5000[2021-11-15 16:50:14.772] [minnow-Log] [info] Done parsing matrix
[2021-11-15 16:50:14.772] [minnow-Log] [info] Checking for 0 expressed cells

[2021-11-15 16:50:14.772] [minnow-Log] [info]
0 emptyCell size
[2021-11-15 16:50:14.783] [minnow-Log] [info] 10X whitelist file extern/3M-february-2018.txt
[2021-11-15 16:50:14.810] [minnow-Log] [info] CBList.size(): 5000
[2021-11-15 16:50:14.810] [minnow-Log] [info] Number of cells 5000
[2021-11-15 16:50:14.810] [minnow-Log] [info] Number of whiteListedCells 5000
[2021-11-15 16:50:14.810] [minnow-Log] [info] Number of noisyCells 0
[2021-11-15 16:50:14.810] [minnow-Log] [info] Number of Doublets 0
[2021-11-15 16:50:14.812] [minnow-Log] [info] Dumping the truth related files to the disk, this can take considerable time with respect to size, you can avoid it by --noDump
[2021-11-15 16:50:17.360] [minnow-Log] [info] Truth files dumped
PRINTING DEBUG: POOL_SIZE 1048576

[2021-11-15 16:50:22.781] [minnow-Log] [info] Starting Minnow....
[2021-11-15 16:50:22.781] [minnow-Log] [info] Number of cells to be processed 5000
[2021-11-15 16:50:22.781] [minnow-Log] [info] Generating from the De-Bruijn graph provided
Less number of reads are produced that the truth matrix
hiraksarkar commented 2 years ago

Hi Baraa,

I should have placed this check before, but it seems the error is coming from duplicated gene names in the input file. Can you please recheck how you are generating these files, if there is a part of README in minnow that is doing this mess, I would be happy to correct that,

That said, I failed to reproduce this when we don't have duplicated genes, please let me know.

(r-packs) ➜  minnow_example git:(minnow-velocity) ✗ cut -f1 out/splatter/C1/8000-5000-42/quants_mat_rows.txt | wc -l
8000
(r-packs) ➜  minnow_example git:(minnow-velocity) ✗ cut -f1 out/splatter/C1/8000-5000-42/quants_mat_rows.txt | sort | uniq | wc -l
4316

Regards, Hirak

hiraksarkar commented 2 years ago

@baraaorabi I am closing the issue for now, please reopen in case the issue still persists.