Closed yhg926 closed 4 years ago
Hi Huiguang,
The command option you are referring to is deprecated. Please use the following command to give splatter matrix, I will fix the read me to update the current commands.
src/minnow simulate --splatter-mode --g2t tx2gene.tsv --inputdir splatter_dir --PCR 6 -r fasta_file.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa dbg.gfa -w 737K-august-2016.txt --bfh bfh_gencode_v32.txt --custom --gencode
You can generate the gfa file and modified fasta file by running the following command
src/minnow index -r human_txome.transcripts.fa -k 101 -f 20 --tmpdir tmp -p 10 -o ../minnow_ind
Also don't use the conda version, as it is severely outdated, and use the refactor
branch.
Thanks
Referencing https://github.com/COMBINE-lab/minnow/issues/15 here, we don't support normal mode explicitly in the present version, and advice you to use splatter mode.
Here is one example folder on how the data should be provided https://github.com/COMBINE-lab/minnow/tree/minnow-velocity/data/test_data/splatter_data_two_cells
It runs! but how can i obtain this file: bfh_gencode_v32.txt ?
hgyi@sustc-HG:~/tools/minnow/build$ src/minnow simulate --splatter-mode --g2t tx2gene.tsv --inputdir /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/ --PCR 6 -r /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome_debruijn.gfa -w 737K-august-2016.txt --bfh bfh_gencode_v32.txt --custom --gencode
Input directory /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/
Reference Fasta /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa
Number of PCR cycles 6
Erorr rate 0.01
Numeber of threads 2
[2020-09-25 12:11:51.910] [minnow-Log] [info] Reading reference sequences ...
replaced 3 non-ACGT nucleotides with random nucleotides
Transcript file is read
[2020-09-25 12:11:52.564] [minnow-Log] [info] Reference sequence is loaded ...
Skipped 0 transcripts because either short or not present in reference
[2020-09-25 12:11:52.564] [minnow-Log] [info] Number of genes in the txp2gene file: 0
=======================Reading Splatter Matrix=====================
[2020-09-25 12:11:52.564] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_cols.txt
[2020-09-25 12:11:52.565] [minnow-Log] [info] 1000 cells are present
[2020-09-25 12:11:52.565] [minnow-Log] [info] Start parsing Splatter output
[2020-09-25 12:11:52.565] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_rows.txt
[2020-09-25 12:11:52.565] [minnow-Log] [info] 1000 genes are present
[2020-09-25 12:11:52.567] [minnow-Log] [info] Reading count matrix /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat.csv
In Splatter: Number of genes processed : 1000==================Done Parsing Splatter Matrix==================
[2020-09-25 12:11:52.598] [minnow-Log] [info] Splatter matrix is read, with dimension 1000 x 1000
!!!!!!!!!!!!!!!!!! IN DBG MODE !!!!!!!!!!!!!!!!!!!!!!!
Start loading segments...
Saw 494217 contigs in total, unitigMap.size(): 372453
Max contig id 2914051
Starting to load paths
Overlap size 101
Done with GFA
Equivalece class size 372453 trSegmentMap size 0 transcript map size 123247
[DEBUG]-----0
Done Filtering
Equivalece class size 288221 trSegmentMap size 121764 transcript map size 123247
[2020-09-25 12:11:54.510] [minnow-Log] [info] The size of the gene id pool 1
In Splatter: Number of genes processed : 1000[2020-09-25 12:11:54.512] [minnow-Log] [warning] Skipping 1000 genes, gene pool size of de-Bruijn graph 1
[2020-09-25 12:11:54.519] [minnow-Log] [info] Truncated the matrix to dimension 1000 x 0
RSPD :::
bfh_gencode_v32.txt does not exists
It is generally obtained by running alevin, but you can use the statically stored file which is learned from pbmc data, in that case you need to run it like following
src/minnow simulate --splatter-mode --g2t tx2gene.tsv --inputdir /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/ --PCR 6 -r /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome_debruijn.gfa -w 737K-august-2016.txt --countProb
countProb_pbmc_4k.txt --custom --gencode
where countProb_pbmc_4k.txt
can be obtained from https://github.com/COMBINE-lab/minnow/blob/refactor/data/hg/countProb_pbmc_4k.txt
I must mention generally, splatter does not give the gene names, so if you want to give your gene names (--custom
flag is used for that reason), you have to modify /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/quants_mat_rows.txt
to give genenames that tally with quants_mat_rows.txt
It runs but aborted.
hgyi@sustc-HG:~/tools/minnow/build$ src/minnow simulate --splatter-mode --g2t /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_t2g.tsv --inputdir /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/ --PCR 6 -r /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome_debruijn.gfa -w 737K-august-2016.txt --countProb ../data/hg/geneLebelProb_pbmc_4k.txt --custom --gencode
Input directory /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/
Reference Fasta /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa
Number of PCR cycles 6
Erorr rate 0.01
Numeber of threads 2
[2020-09-25 12:39:32.792] [minnow-Log] [info] Reading reference sequences ...
replaced 3 non-ACGT nucleotides with random nucleotides
Transcript file is read
[2020-09-25 12:39:33.452] [minnow-Log] [info] Reference sequence is loaded ...
Skipped 2418 transcripts because either short or not present in reference
[2020-09-25 12:39:33.589] [minnow-Log] [info] Number of genes in the txp2gene file: 48303
=======================Reading Splatter Matrix=====================
[2020-09-25 12:39:33.589] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_cols.txt
[2020-09-25 12:39:33.589] [minnow-Log] [info] 1000 cells are present
[2020-09-25 12:39:33.589] [minnow-Log] [info] Start parsing Splatter output
[2020-09-25 12:39:33.589] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_rows.txt
[2020-09-25 12:39:33.589] [minnow-Log] [info] 1000 genes are present
[2020-09-25 12:39:33.591] [minnow-Log] [info] Reading count matrix /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat.csv
In Splatter: Number of genes processed : 1000==================Done Parsing Splatter Matrix==================
[2020-09-25 12:39:33.622] [minnow-Log] [info] Splatter matrix is read, with dimension 1000 x 1000
!!!!!!!!!!!!!!!!!! IN DBG MODE !!!!!!!!!!!!!!!!!!!!!!!
Start loading segments...
Saw 494217 contigs in total, unitigMap.size(): 372453
Max contig id 2914051
Starting to load paths
Overlap size 101
Done with GFA
Equivalece class size 372453 trSegmentMap size 0 transcript map size 123247
[DEBUG]-----0
Done Filtering
Equivalece class size 288221 trSegmentMap size 121764 transcript map size 123247
[2020-09-25 12:39:35.554] [minnow-Log] [info] The size of the gene id pool 47320
In Splatter: Number of genes processed : 1000[2020-09-25 12:39:35.568] [minnow-Log] [warning] Skipping 1000 genes, gene pool size of de-Bruijn graph 47320
[2020-09-25 12:39:35.574] [minnow-Log] [info] Truncated the matrix to dimension 1000 x 0
RSPD :::
terminate called after throwing an instance of 'std::invalid_argument'
what(): stod
Aborted (core dumped)
The log says that the Truncated the matrix to dimension 1000 x 0
which means all of the genes are dropped, it can be caused by many reasons, e.g if there is a name mismatch between your transcript to gene file and the gene names in quants_mat_rows.txt
etc. If you can provide me with the input data, I can look into it tomorrow or during the weekend.
I just modified the R code to generated splatter_out, change num_genes from 1000 to 125665(same number with the transcript numer in mouse_transcriptome.fa ) , but still aborted, the Truncated the matrix dimension now is 1000 x 47320. Here is my R code:
BiocManager::install("splatter")
library(splatter)
library(scater)
set.seed(1)
num_genes<- 125665
num_cells <- 1000
sim <- splatSimulate(
nGenes=num_genes,
batchCells=num_cells,
verbose = FALSE
)
out_dir<-"/data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out"
write.table(rownames(sim), file= file.path(out_dir, "quants_mat_rows.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(colnames(sim), file= file.path(out_dir, "quants_mat_cols.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(counts(sim), file= file.path(out_dir, "quants_mat.csv"), quote=FALSE, col.names=FALSE, row.names=FALSE, sep=",")
=================error information======
hgyi@sustc-HG:~/tools/minnow/build$ src/minnow simulate --splatter-mode --g2t /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_t2g.tsv --inputdir /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/ --PCR 6 -r /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome_debruijn.gfa -w 737K-august-2016.txt --countProb ../data/hg/geneLebelProb_pbmc_4k.txt
Input directory /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/
Reference Fasta /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa
Number of PCR cycles 6
Erorr rate 0.01
Numeber of threads 2
[2020-09-25 13:01:07.951] [minnow-Log] [info] Reading reference sequences ...
replaced 3 non-ACGT nucleotides with random nucleotides
Transcript file is read
[2020-09-25 13:01:08.611] [minnow-Log] [info] Reference sequence is loaded ...
Skipped 2418 transcripts because either short or not present in reference
[2020-09-25 13:01:08.747] [minnow-Log] [info] Number of genes in the txp2gene file: 48303
=======================Reading Splatter Matrix=====================
[2020-09-25 13:01:08.747] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_cols.txt
[2020-09-25 13:01:08.747] [minnow-Log] [info] 1000 cells are present
[2020-09-25 13:01:08.747] [minnow-Log] [info] Start parsing Splatter output
[2020-09-25 13:01:08.747] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_rows.txt
[2020-09-25 13:01:08.754] [minnow-Log] [info] 125665 genes are present
[2020-09-25 13:01:09.092] [minnow-Log] [info] Reading count matrix /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat.csv
In Splatter: Number of genes processed : 125665==================Done Parsing Splatter Matrix==================
[2020-09-25 13:01:12.399] [minnow-Log] [info] Splatter matrix is read, with dimension 1000 x 125665
!!!!!!!!!!!!!!!!!! IN DBG MODE !!!!!!!!!!!!!!!!!!!!!!!
Start loading segments...
Saw 494217 contigs in total, unitigMap.size(): 372453
Max contig id 2914051
Starting to load paths
Overlap size 101
Done with GFA
Equivalece class size 372453 trSegmentMap size 0 transcript map size 123247
[DEBUG]-----0
Done Filtering
Equivalece class size 288221 trSegmentMap size 121764 transcript map size 123247
[2020-09-25 13:01:14.317] [minnow-Log] [info] The size of the gene id pool 47320
In Splatter: Number of genes processed : 125665[2020-09-25 13:01:15.056] [minnow-Log] [warning] Skipping 78345 genes, gene pool size of de-Bruijn graph 47320
[2020-09-25 13:01:15.909] [minnow-Log] [info] Truncated the matrix to dimension 1000 x 47320
RSPD :::
terminate called after throwing an instance of 'std::invalid_argument'
what(): stod
Aborted (core dumped)
I just modified the R code to generated splatter_out, change num_genes from 1000 to 125665(same number with the transcript numer in mouse_transcriptome.fa ) , but still aborted, the Truncated the matrix dimension now is 1000 x 47320.
I think there is a confusion here, the matrix that you should provide to minnow from splatter should contain gene-to-cell count and not transcript-to-cell counts.
Can you create a relatively small matrix for 1000 cells and 1000 genes, and share the files? I can try debugging then, I think there is a problem with the convention here. Not sure what's the exact bug.
Here are my input files: quants_mat_rows.txt quants_mat_cols.txt quants_mat.csv.zip
As I mentioned to you https://github.com/COMBINE-lab/minnow/issues/14#issuecomment-698713331 and https://github.com/COMBINE-lab/minnow/issues/14#issuecomment-698717650, quants_mat_rows.txt
contains no real gene name, you have to provide your gene names that are real.
$ head -4 quants_mat_rows.txt
Gene1
Gene2
Gene3
Gene4
Gene5
I just did this:
cut -f2 mouse_t2g.tsv > splatter_out/quants_mat_rows.txt
so all gene name match the mouse_t2g.tsv: and run minnow again, but still aborted with same information:
hgyi@sustc-HG:~/tools/minnow/build$ src/minnow simulate --splatter-mode --g2t /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_t2g.tsv --inputdir /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/ --PCR 6 -r /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome_debruijn.gfa -w 737K-august-2016.txt --countProb ../data/hg/geneLebelProb_pbmc_4k.txt
Input directory /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/
Reference Fasta /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa
Number of PCR cycles 6
Erorr rate 0.01
Numeber of threads 2
[2020-09-25 13:24:28.310] [minnow-Log] [info] Reading reference sequences ...
replaced 3 non-ACGT nucleotides with random nucleotides
Transcript file is read
[2020-09-25 13:24:28.975] [minnow-Log] [info] Reference sequence is loaded ...
Skipped 2418 transcripts because either short or not present in reference
[2020-09-25 13:24:29.110] [minnow-Log] [info] Number of genes in the txp2gene file: 48303
=======================Reading Splatter Matrix=====================
[2020-09-25 13:24:29.110] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_cols.txt
[2020-09-25 13:24:29.111] [minnow-Log] [info] 1000 cells are present
[2020-09-25 13:24:29.111] [minnow-Log] [info] Start parsing Splatter output
[2020-09-25 13:24:29.111] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_rows.txt
[2020-09-25 13:24:29.121] [minnow-Log] [info] 125665 genes are present
[2020-09-25 13:24:29.467] [minnow-Log] [info] Reading count matrix /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat.csv
In Splatter: Number of genes processed : 125665==================Done Parsing Splatter Matrix==================
[2020-09-25 13:24:33.053] [minnow-Log] [info] Splatter matrix is read, with dimension 1000 x 125665
!!!!!!!!!!!!!!!!!! IN DBG MODE !!!!!!!!!!!!!!!!!!!!!!!
Start loading segments...
Saw 494217 contigs in total, unitigMap.size(): 372453
Max contig id 2914051
Starting to load paths
Overlap size 101
Done with GFA
Equivalece class size 372453 trSegmentMap size 0 transcript map size 123247
[DEBUG]-----0
Done Filtering
Equivalece class size 288221 trSegmentMap size 121764 transcript map size 123247
[2020-09-25 13:24:34.977] [minnow-Log] [info] The size of the gene id pool 47320
In Splatter: Number of genes processed : 125665[2020-09-25 13:24:35.740] [minnow-Log] [warning] Skipping 78345 genes, gene pool size of de-Bruijn graph 47320
[2020-09-25 13:24:36.588] [minnow-Log] [info] Truncated the matrix to dimension 1000 x 47320
RSPD :::
terminate called after throwing an instance of 'std::invalid_argument'
what(): stod
Aborted (core dumped)
Interesting, thanks for the log, I will try to reproduce this tomorrow.
Can you attach your mouse_t2g.tsv
file.
OK, it actually downloaded from your zenodo link: mouse_t2g.tsv.zip
Hi @yhg926,
I tried to debug the problem, the problems seem to be the fact that the gene names are not unique, as many transcripts can map to the same genes, the t2g will have more lines than genes. So below are the steps that you might run to get a fully functional example.
git clone --single-branch --branch minnow-velocity https://github.com/COMBINE-lab/minnow.git
cd minnow
mkdir build
cd build
cmake ..
make
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.transcripts.fa.gz
awk -F "\t" '$3 == "transcript" { print $9 }' <( zcat gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz ) | tr -d ";\"" | awk '{print $4"\t"$2}' > tx2gene.tsv
src/minnow index -r ../mouse/GRCm38.p6/gencode.vM25.transcripts.fa.gz -k 101 -f 20 --tmpdir tmp -p 10 -o minnow_ind
BiocManager::install("splatter")
library(splatter)
library(scater)
set.seed(1)
num_genes<- 1000
num_cells <- 1000
sim <- splatSimulate(
nGenes=num_genes,
batchCells=num_cells,
verbose = FALSE
)
out_dir<-"splatter_out"
write.table(rownames(sim), file= file.path(out_dir, "quants_mat_rows.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(colnames(sim), file= file.path(out_dir, "quants_mat_cols.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(counts(sim), file= file.path(out_dir, "quants_mat.csv"), quote=FALSE, col.names=FALSE, row.names=FALSE, sep=",")
Assigning unique gene names (same number of unique gene names as your rows in the count matrix generated in splatter)
cut -f 2 ../mouse/GRCm38.p6/tx2gene.tsv | sort | uniq > ../mouse/GRCm38.p6/gene.list
shuf -n 1000 ../mouse/GRCm38.p6/gene.list > splatter_out/quants_mat_rows.txt
Now you can run minnow with the files generated above
src/minnow simulate --splatter-mode \
--g2t tx2gene.tsv \
--inputdir splatter_out \
--PCR 6 \
-r minnow_ind/ref_k101_fixed.fa \
-e 0.01 -p 10 -o minnow_splatter_out \
--dbg --gfa minnow_ind/dbg.gfa \
-w ../data/737K-august-2016.txt --countProb ../data/hg/countProb_pbmc_4k.txt --custom
Thank you very much, i have already generated the simulated reads following your steps. I am now wondering what kind of protocol the simulated reads use ? especially, what is the cell barcodes, UMI and the biological sequences of the generated simulated fastq files?
here are head 20 lines of the R1 and R2 fastq file, what is the cell barcodes, UMI and the biological sequences ? :
hgyi@sustc-HG:/data/hgyi/work/RNAseq_quantTool/scsimulation/minnow_splatter_out$ head -20
hg_100_S1_L001_R1_001.fastq
@AAACCTGGTCTAGCGC:ENSMUST00000181274.1:1116:0:0
AAACCTGGTCTAGCGCGTTATCCTCA
+
NNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1357:1:1
AAACCTGGTCTAGCGCAAGGGACCTT
+
NNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1350:2:2
AAACCTGGTCTAGCGCGCTATGCACA
+
NNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1356:3:3
AAACCTGGTCTAGCGCTAACGTTTGG
+
NNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1341:4:4
AAACCTGGTCTAGCGCAAGGCCTTCT
+
NNNNNNNNNNNNNNNNNNNNNNNNNN
hgyi@sustc-HG:/data/hgyi/work/RNAseq_quantTool/scsimulation/minnow_splatter_out$ head -20 hg_100_S1_L001_R2_001.fastq
@AAACCTGGTCTAGCGC:ENSMUST00000181274.1:1116:0:0
GGACAAGGTTAAAGGGTTTTGTTTTGTTTTTCTTTTTTTCTTTCAGTAGTTAACCACCAGATTTGAACCAAAAAACATTTAGTGTTTAAAAAAAAAAAGA
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1357:1:1
TGTGGGGGCCTCACCTATGGCACAACCCCAGGGCGCCAGATTGCCTCTGGACCCTCTGTCACTGGGGGCAGCATCACAGTGATGGCCCCTGACTCCTGCT
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1350:2:2
AGTCACATGTGGGGGCCTCACCTATGGCACAACCCCAGGGCGCCAGATTGCCTCTGGACCCTCTGTCACTGGGGGCAGCATCACAGTGATGGCNCCTGAC
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1356:3:3
ATGTGGGGGGCTCACCTATGGCACAACCCCAGGGCGCCANATTGCCTCTGGACCCTCTGTCACTGGGGGCAGCATCACAGTGATGGCCCCTGACTCCTGC
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1341:4:4
CCGTGGAGGAGTCACATGTGGGGGCCTCACCTATGGCACAACCCCAGGGCGCCAGATTGCCTCTGGACCCTCTGTCACTGGGGGCAGCATCACAGTGATG
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
another question: I noticed you are using this reference: minnow_ind/ref_k101_fixed.fa, so what is its differences with the original reference gencode.vM25.transcripts.fa.gz ?, the simulated reads were generated from ref_k101_fixed.fa not gencode.vM25.transcripts.fa.gz , right ?
Hi,
The left end hg_100_S1_L001_R1_001.fastq
contains the cell barcode and the UMI (16 + 10 bp), and the right end hg_100_S1_L001_R2_001.fastq
contains the sequences.
Regarding the second question ref_k101_fixed.fa
is the filtered version of gencode.vM25.transcripts.fa.gz
, removing transcripts having a sequence length lesser than 101bp, (which is read length + 1).
what do the last three number means in R2 fastq head line? for example, in below fastq, the three number 1239:4147885:99998. It seems 1239 is offset from the transcript start, and 99998 is the read number? but i can not figure out what is the meaning of 4147885.
hgyi@sustc-HG:/data/hgyi/work/RNAseq_quantTool/scsimulation/minnow_splatter_out100$ tail -8 hg_100_S1_L001_R2_001.fastq
@TGCACCTTCGGCGCTA:ENSMUST00000195555.1:1239:4147885:99998
ACCCAGCGCACGGCCCCGGGCAACCTTCGCCCCCTCCCGAGGCTCTGCCCTGCCGGGATGGCACGGAATCCAACCAGCCCACTGAGCTCCTAGGGGAGGT
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
@TGCACCTTCGGCGCTA:ENSMUST00000171265.7:2690:1823702:99999
TGGGTTGGTGTGAATGTGTGTCACCTACACATTCTAACAGAAGGTAACAATACGTTAGCAGTGACATATTCAGTATTACTCATTTTGTAGTGCCAGGGAT
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Hi,
That's a great question, 4147885:99998
this is generally used for other debugging, and is not directly usable by end user.
4147885
is the index of read in the order they are dumped by the thread (not necessarily in the same order they appear in the file, since it's a multi-threaded code). 99998
is the id number of read before doing PCR.
However, these two numbers are of no use for other analyses.
PS: If you can please format the input before pasting to the issue that would be really helpful for me to inspect the log from phone or other devices
Thanks
Yes, for details please check https://academic.oup.com/bioinformatics/article/35/14/i136/5529127.
As the query is solved, I would close this issue, feel free to open issue for other questions.
Ok I think your last query seems like another question, so I am reopening.
Can you please elaborate what you mean by not mapping to 3' end of the reference. These reads are sampled from end of the transcripts.
I just validated an example file, all the reads seems to come from the 3' end of the transcripts. Let me know what is expected in your case, that you are not seeing.
I just validated an example file, all the reads seems to come from the 3' end of the transcripts. Let me know what is expected in your case, that you are not seeing.
Sorry, I thought i did not post the right question. The information i want to know is what is the mean distance and its variation between R2 reads and the 3' end of the reference transcripts. For example, i noticed this read : @AAACCTGAGGGTATCG:ENSMUST00000192913.1:897:24:24 it start from position 897 of ENSMUST00000192913.1, and ENSMUST00000192913.1 has length of 1506 so the interval between this read and end of its reference is 509
and this read: @AAACCTGAGGGTATCG:ENSMUST00000192847.5:673:249:249 start from position 673 of ENSMUST00000192847.5 , and ENSMUST00000192847.5 has length of 1662. so the interval between this read and end of its reference is 889.
i can compute the interval variations by myself, i am not quite familiar with single cell RNA-seq technique, but i think the variation is quite big, did these reflect the situation in the real sequencing data?
Hi @yhg926,
The maximum fragment length is set to 1000
according to Illumina fragmentation protocol, so the read sequences are sampled in order to imitate the multi-mapping rate rather than to imitate any kind of positional distribution. However, if you can elaborate on what you exactly want, we can try to modify minnow in order to fit your need.
If you want to see positional distribution please follow these steps,
To check the edit distance distribution and the distribution of distance from end
please run the following
src/validate validate -f <( gunzip -c hg_100_S1_L001_R2_001.fastq.gz ) -t minnow_ind/ref_k101_fixed.fa -o splatter_out/edit.dist -m 10
-m
option is for edit distance, if you use a PCR cycle of 6, it's unlikely to see a sequence that is more than 6 edit distance apart. If you use a higher PCR cycle then use higher values to find the edit-distance distribution with arbitrary high edit distances,
The edit distance distribution will be followed by a distribution of the distance from end
. A snapshot of the output from the above command is as follows,
With the following format
edit distance <number of reads with that edit distance>
.
.
--
distance from end <number of reads with that distance from end>
.
.
Example,
➜ build git:(minnow-velocity) ✗ head -20 splatter_out/edit.dist
0 606398
1 304145
2 75481
3 12264
4 1537
5 157
6 18
--
100 6522
101 6490
102 6637
103 6604
104 6161
105 6131
106 5891
107 5887
108 5213
109 5448
110 5405
111 5295
...
The lines below --
show the positional distribution of distance from end
.
Hi @yhg926 Do you have any more questions/doubts, otherwise, I would close this issue.
no further questions. thank you!
Sent from my iPhone
On Oct 4, 2020, at 11:37 AM, Hirak Sarkar notifications@github.com wrote:
Hi @yhg926 Do you have any more questions/doubts, otherwise, I would close this issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Hi, I want to what the option "-m /example_data/splatter_data_100_Cells_50K_Genes_pbmc/" means? I cannot find the explanation on this option. And how can i generate myself splatter_data_100_Cells_50K_Genes_pbmc folder? Many thanks