COMBINE-lab / minnow

10 stars 2 forks source link

what is -m option ? #14

Closed yhg926 closed 4 years ago

yhg926 commented 4 years ago

Hi, I want to what the option "-m /example_data/splatter_data_100_Cells_50K_Genes_pbmc/" means? I cannot find the explanation on this option. And how can i generate myself splatter_data_100_Cells_50K_Genes_pbmc folder? Many thanks

src/minnow simulate --splatter-mode --g2t /mnt/scratch6/avi/data/cgat/references/metadata/hg_t2g.tsv -m ../example_data/splatter_data_100_Cells_50K_Genes_pbmc/ --PCR 7 -r /mnt/scratch6/avi/data/cgat/references/txome/hg_transcriptome.fasta -e 0.01 -p 25 -o <out_dir> --useWeibull --testUniqness --reverseUniqness
hiraksarkar commented 4 years ago

Hi Huiguang,

The command option you are referring to is deprecated. Please use the following command to give splatter matrix, I will fix the read me to update the current commands.

src/minnow simulate --splatter-mode --g2t tx2gene.tsv --inputdir splatter_dir --PCR 6 -r fasta_file.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa dbg.gfa  -w 737K-august-2016.txt --bfh bfh_gencode_v32.txt --custom --gencode

You can generate the gfa file and modified fasta file by running the following command

src/minnow index -r human_txome.transcripts.fa -k 101 -f 20 --tmpdir tmp -p 10 -o ../minnow_ind

Also don't use the conda version, as it is severely outdated, and use the refactor branch.

Thanks

hiraksarkar commented 4 years ago

Referencing https://github.com/COMBINE-lab/minnow/issues/15 here, we don't support normal mode explicitly in the present version, and advice you to use splatter mode.

hiraksarkar commented 4 years ago

Here is one example folder on how the data should be provided https://github.com/COMBINE-lab/minnow/tree/minnow-velocity/data/test_data/splatter_data_two_cells

yhg926 commented 4 years ago

It runs! but how can i obtain this file: bfh_gencode_v32.txt ?

hgyi@sustc-HG:~/tools/minnow/build$ src/minnow simulate --splatter-mode --g2t tx2gene.tsv --inputdir /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/ --PCR 6 -r /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome_debruijn.gfa -w 737K-august-2016.txt --bfh bfh_gencode_v32.txt --custom --gencode
Input directory  /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/
Reference Fasta /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa
Number of PCR cycles 6
Erorr rate 0.01
Numeber of threads 2
[2020-09-25 12:11:51.910] [minnow-Log] [info] Reading reference sequences ...
replaced 3 non-ACGT nucleotides with random nucleotides
Transcript file is read
[2020-09-25 12:11:52.564] [minnow-Log] [info] Reference sequence is loaded ...
Skipped 0 transcripts because either short or not present in reference
[2020-09-25 12:11:52.564] [minnow-Log] [info] Number of genes in the txp2gene file: 0
=======================Reading Splatter Matrix=====================
[2020-09-25 12:11:52.564] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_cols.txt
[2020-09-25 12:11:52.565] [minnow-Log] [info] 1000 cells are present
[2020-09-25 12:11:52.565] [minnow-Log] [info] Start parsing Splatter output
[2020-09-25 12:11:52.565] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_rows.txt
[2020-09-25 12:11:52.565] [minnow-Log] [info] 1000 genes are present
[2020-09-25 12:11:52.567] [minnow-Log] [info] Reading count matrix /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat.csv
In Splatter: Number of genes processed : 1000==================Done Parsing Splatter Matrix==================
[2020-09-25 12:11:52.598] [minnow-Log] [info] Splatter matrix is read, with dimension 1000 x 1000

 !!!!!!!!!!!!!!!!!! IN DBG MODE !!!!!!!!!!!!!!!!!!!!!!!
Start loading segments...
Saw 494217 contigs in total, unitigMap.size(): 372453
Max contig id 2914051
Starting to load paths
Overlap size 101
Done with GFA
Equivalece class size 372453    trSegmentMap size 0 transcript map size 123247
[DEBUG]-----0
Done Filtering
Equivalece class size 288221    trSegmentMap size 121764    transcript map size 123247
[2020-09-25 12:11:54.510] [minnow-Log] [info] The size of the gene id pool 1
In Splatter: Number of genes processed : 1000[2020-09-25 12:11:54.512] [minnow-Log] [warning] Skipping 1000 genes, gene pool size of de-Bruijn graph 1
[2020-09-25 12:11:54.519] [minnow-Log] [info] Truncated the matrix to dimension 1000 x 0
RSPD :::
bfh_gencode_v32.txt does not exists
hiraksarkar commented 4 years ago

It is generally obtained by running alevin, but you can use the statically stored file which is learned from pbmc data, in that case you need to run it like following

 src/minnow simulate --splatter-mode --g2t tx2gene.tsv --inputdir /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/ --PCR 6 -r /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome_debruijn.gfa -w 737K-august-2016.txt --countProb 
 countProb_pbmc_4k.txt --custom --gencode

where countProb_pbmc_4k.txt can be obtained from https://github.com/COMBINE-lab/minnow/blob/refactor/data/hg/countProb_pbmc_4k.txt

hiraksarkar commented 4 years ago

I must mention generally, splatter does not give the gene names, so if you want to give your gene names (--custom flag is used for that reason), you have to modify /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/quants_mat_rows.txt to give genenames that tally with quants_mat_rows.txt

yhg926 commented 4 years ago

It runs but aborted.

hgyi@sustc-HG:~/tools/minnow/build$ src/minnow simulate --splatter-mode --g2t /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_t2g.tsv --inputdir /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/ --PCR 6 -r /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome_debruijn.gfa -w 737K-august-2016.txt --countProb ../data/hg/geneLebelProb_pbmc_4k.txt --custom --gencode
Input directory  /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/
Reference Fasta /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa
Number of PCR cycles 6
Erorr rate 0.01
Numeber of threads 2
[2020-09-25 12:39:32.792] [minnow-Log] [info] Reading reference sequences ...
replaced 3 non-ACGT nucleotides with random nucleotides
Transcript file is read
[2020-09-25 12:39:33.452] [minnow-Log] [info] Reference sequence is loaded ...
Skipped 2418 transcripts because either short or not present in reference
[2020-09-25 12:39:33.589] [minnow-Log] [info] Number of genes in the txp2gene file: 48303
=======================Reading Splatter Matrix=====================
[2020-09-25 12:39:33.589] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_cols.txt
[2020-09-25 12:39:33.589] [minnow-Log] [info] 1000 cells are present
[2020-09-25 12:39:33.589] [minnow-Log] [info] Start parsing Splatter output
[2020-09-25 12:39:33.589] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_rows.txt
[2020-09-25 12:39:33.589] [minnow-Log] [info] 1000 genes are present
[2020-09-25 12:39:33.591] [minnow-Log] [info] Reading count matrix /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat.csv
In Splatter: Number of genes processed : 1000==================Done Parsing Splatter Matrix==================
[2020-09-25 12:39:33.622] [minnow-Log] [info] Splatter matrix is read, with dimension 1000 x 1000

 !!!!!!!!!!!!!!!!!! IN DBG MODE !!!!!!!!!!!!!!!!!!!!!!!
Start loading segments...
Saw 494217 contigs in total, unitigMap.size(): 372453
Max contig id 2914051
Starting to load paths
Overlap size 101
Done with GFA
Equivalece class size 372453    trSegmentMap size 0 transcript map size 123247
[DEBUG]-----0
Done Filtering
Equivalece class size 288221    trSegmentMap size 121764    transcript map size 123247
[2020-09-25 12:39:35.554] [minnow-Log] [info] The size of the gene id pool 47320
In Splatter: Number of genes processed : 1000[2020-09-25 12:39:35.568] [minnow-Log] [warning] Skipping 1000 genes, gene pool size of de-Bruijn graph 47320
[2020-09-25 12:39:35.574] [minnow-Log] [info] Truncated the matrix to dimension 1000 x 0
RSPD :::
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stod
Aborted (core dumped)
hiraksarkar commented 4 years ago

The log says that the Truncated the matrix to dimension 1000 x 0 which means all of the genes are dropped, it can be caused by many reasons, e.g if there is a name mismatch between your transcript to gene file and the gene names in quants_mat_rows.txt etc. If you can provide me with the input data, I can look into it tomorrow or during the weekend.

yhg926 commented 4 years ago

I just modified the R code to generated splatter_out, change num_genes from 1000 to 125665(same number with the transcript numer in mouse_transcriptome.fa ) , but still aborted, the Truncated the matrix dimension now is 1000 x 47320. Here is my R code:

BiocManager::install("splatter")
library(splatter)
library(scater)
set.seed(1)
num_genes<- 125665
num_cells <- 1000
sim <- splatSimulate( 
  nGenes=num_genes, 
  batchCells=num_cells, 
  verbose = FALSE
)
out_dir<-"/data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out"
write.table(rownames(sim), file= file.path(out_dir, "quants_mat_rows.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(colnames(sim), file= file.path(out_dir, "quants_mat_cols.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(counts(sim), file= file.path(out_dir, "quants_mat.csv"), quote=FALSE, col.names=FALSE, row.names=FALSE, sep=",")
=================error information======
hgyi@sustc-HG:~/tools/minnow/build$ src/minnow simulate --splatter-mode --g2t /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_t2g.tsv --inputdir /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/ --PCR 6 -r /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome_debruijn.gfa -w 737K-august-2016.txt --countProb ../data/hg/geneLebelProb_pbmc_4k.txt
Input directory  /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/
Reference Fasta /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa
Number of PCR cycles 6
Erorr rate 0.01
Numeber of threads 2
[2020-09-25 13:01:07.951] [minnow-Log] [info] Reading reference sequences ...
replaced 3 non-ACGT nucleotides with random nucleotides
Transcript file is read
[2020-09-25 13:01:08.611] [minnow-Log] [info] Reference sequence is loaded ...
Skipped 2418 transcripts because either short or not present in reference
[2020-09-25 13:01:08.747] [minnow-Log] [info] Number of genes in the txp2gene file: 48303
=======================Reading Splatter Matrix=====================
[2020-09-25 13:01:08.747] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_cols.txt
[2020-09-25 13:01:08.747] [minnow-Log] [info] 1000 cells are present
[2020-09-25 13:01:08.747] [minnow-Log] [info] Start parsing Splatter output
[2020-09-25 13:01:08.747] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_rows.txt
[2020-09-25 13:01:08.754] [minnow-Log] [info] 125665 genes are present
[2020-09-25 13:01:09.092] [minnow-Log] [info] Reading count matrix /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat.csv
In Splatter: Number of genes processed : 125665==================Done Parsing Splatter Matrix==================
[2020-09-25 13:01:12.399] [minnow-Log] [info] Splatter matrix is read, with dimension 1000 x 125665

 !!!!!!!!!!!!!!!!!! IN DBG MODE !!!!!!!!!!!!!!!!!!!!!!!
Start loading segments...
Saw 494217 contigs in total, unitigMap.size(): 372453
Max contig id 2914051
Starting to load paths
Overlap size 101
Done with GFA
Equivalece class size 372453    trSegmentMap size 0 transcript map size 123247
[DEBUG]-----0
Done Filtering
Equivalece class size 288221    trSegmentMap size 121764    transcript map size 123247
[2020-09-25 13:01:14.317] [minnow-Log] [info] The size of the gene id pool 47320
In Splatter: Number of genes processed : 125665[2020-09-25 13:01:15.056] [minnow-Log] [warning] Skipping 78345 genes, gene pool size of de-Bruijn graph 47320
[2020-09-25 13:01:15.909] [minnow-Log] [info] Truncated the matrix to dimension 1000 x 47320
RSPD :::
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stod
Aborted (core dumped)
hiraksarkar commented 4 years ago

I just modified the R code to generated splatter_out, change num_genes from 1000 to 125665(same number with the transcript numer in mouse_transcriptome.fa ) , but still aborted, the Truncated the matrix dimension now is 1000 x 47320.

I think there is a confusion here, the matrix that you should provide to minnow from splatter should contain gene-to-cell count and not transcript-to-cell counts.

Can you create a relatively small matrix for 1000 cells and 1000 genes, and share the files? I can try debugging then, I think there is a problem with the convention here. Not sure what's the exact bug.

yhg926 commented 4 years ago

Here are my input files: quants_mat_rows.txt quants_mat_cols.txt quants_mat.csv.zip

hiraksarkar commented 4 years ago

As I mentioned to you https://github.com/COMBINE-lab/minnow/issues/14#issuecomment-698713331 and https://github.com/COMBINE-lab/minnow/issues/14#issuecomment-698717650, quants_mat_rows.txt contains no real gene name, you have to provide your gene names that are real.

$ head -4 quants_mat_rows.txt

Gene1
Gene2
Gene3
Gene4
Gene5
yhg926 commented 4 years ago

I just did this:

 cut -f2 mouse_t2g.tsv > splatter_out/quants_mat_rows.txt

so all gene name match the mouse_t2g.tsv: and run minnow again, but still aborted with same information:

hgyi@sustc-HG:~/tools/minnow/build$ src/minnow simulate --splatter-mode --g2t /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_t2g.tsv --inputdir /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/ --PCR 6 -r /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa -e 0.01 -p 2 -o minnow_out --dbg --gfa /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome_debruijn.gfa -w 737K-august-2016.txt --countProb ../data/hg/geneLebelProb_pbmc_4k.txt
Input directory  /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out/
Reference Fasta /data/hgyi/work/RNAseq_quantTool/scsimulation/mouse_transcriptome.fa
Number of PCR cycles 6
Erorr rate 0.01
Numeber of threads 2
[2020-09-25 13:24:28.310] [minnow-Log] [info] Reading reference sequences ...
replaced 3 non-ACGT nucleotides with random nucleotides
Transcript file is read
[2020-09-25 13:24:28.975] [minnow-Log] [info] Reference sequence is loaded ...
Skipped 2418 transcripts because either short or not present in reference
[2020-09-25 13:24:29.110] [minnow-Log] [info] Number of genes in the txp2gene file: 48303
=======================Reading Splatter Matrix=====================
[2020-09-25 13:24:29.110] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_cols.txt
[2020-09-25 13:24:29.111] [minnow-Log] [info] 1000 cells are present
[2020-09-25 13:24:29.111] [minnow-Log] [info] Start parsing Splatter output
[2020-09-25 13:24:29.111] [minnow-Log] [info] Parsing /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat_rows.txt
[2020-09-25 13:24:29.121] [minnow-Log] [info] 125665 genes are present
[2020-09-25 13:24:29.467] [minnow-Log] [info] Reading count matrix /data/hgyi/work/RNAseq_quantTool/scsimulation/splatter_out//quants_mat.csv
In Splatter: Number of genes processed : 125665==================Done Parsing Splatter Matrix==================
[2020-09-25 13:24:33.053] [minnow-Log] [info] Splatter matrix is read, with dimension 1000 x 125665

 !!!!!!!!!!!!!!!!!! IN DBG MODE !!!!!!!!!!!!!!!!!!!!!!!
Start loading segments...
Saw 494217 contigs in total, unitigMap.size(): 372453
Max contig id 2914051
Starting to load paths
Overlap size 101
Done with GFA
Equivalece class size 372453    trSegmentMap size 0 transcript map size 123247
[DEBUG]-----0
Done Filtering
Equivalece class size 288221    trSegmentMap size 121764    transcript map size 123247
[2020-09-25 13:24:34.977] [minnow-Log] [info] The size of the gene id pool 47320
In Splatter: Number of genes processed : 125665[2020-09-25 13:24:35.740] [minnow-Log] [warning] Skipping 78345 genes, gene pool size of de-Bruijn graph 47320
[2020-09-25 13:24:36.588] [minnow-Log] [info] Truncated the matrix to dimension 1000 x 47320
RSPD :::
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stod
Aborted (core dumped)
hiraksarkar commented 4 years ago

Interesting, thanks for the log, I will try to reproduce this tomorrow.

hiraksarkar commented 4 years ago

Can you attach your mouse_t2g.tsv file.

yhg926 commented 4 years ago

OK, it actually downloaded from your zenodo link: mouse_t2g.tsv.zip

hiraksarkar commented 4 years ago

Hi @yhg926,

I tried to debug the problem, the problems seem to be the fact that the gene names are not unique, as many transcripts can map to the same genes, the t2g will have more lines than genes. So below are the steps that you might run to get a fully functional example.

Getting the right branch of minnow

git clone --single-branch --branch minnow-velocity https://github.com/COMBINE-lab/minnow.git
cd minnow
mkdir build
cd build
cmake ..
make

Making consistent transcript to gene name and compatible De-Bruijn graphs

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.transcripts.fa.gz

extracting the t2g mapping

awk -F "\t" '$3 == "transcript" { print $9 }' <( zcat gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz ) | tr -d ";\"" | awk '{print $4"\t"$2}' > tx2gene.tsv

making right de-Bruijn graph

src/minnow index -r ../mouse/GRCm38.p6/gencode.vM25.transcripts.fa.gz -k 101 -f 20 --tmpdir tmp -p 10 -o minnow_ind

Generating the splatter matrix (I followed your example just making a realistic gene number)

BiocManager::install("splatter")
library(splatter)
library(scater)
set.seed(1)
num_genes<- 1000
num_cells <- 1000
sim <- splatSimulate( 
  nGenes=num_genes, 
  batchCells=num_cells, 
  verbose = FALSE
)
out_dir<-"splatter_out"
write.table(rownames(sim), file= file.path(out_dir, "quants_mat_rows.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(colnames(sim), file= file.path(out_dir, "quants_mat_cols.txt"), quote=FALSE, col.names=FALSE, row.names=FALSE)
write.table(counts(sim), file= file.path(out_dir, "quants_mat.csv"), quote=FALSE, col.names=FALSE, row.names=FALSE, sep=",")

Assigning unique gene names (same number of unique gene names as your rows in the count matrix generated in splatter)

cut -f 2 ../mouse/GRCm38.p6/tx2gene.tsv | sort | uniq > ../mouse/GRCm38.p6/gene.list
shuf -n 1000 ../mouse/GRCm38.p6/gene.list > splatter_out/quants_mat_rows.txt

Now you can run minnow with the files generated above

Run minnow (assuming inside build)

src/minnow simulate --splatter-mode \
--g2t tx2gene.tsv \
--inputdir splatter_out \
--PCR 6 \
-r minnow_ind/ref_k101_fixed.fa \
-e 0.01 -p 10 -o minnow_splatter_out \
--dbg --gfa minnow_ind/dbg.gfa \
-w ../data/737K-august-2016.txt --countProb ../data/hg/countProb_pbmc_4k.txt --custom 
yhg926 commented 4 years ago

Thank you very much, i have already generated the simulated reads following your steps. I am now wondering what kind of protocol the simulated reads use ? especially, what is the cell barcodes, UMI and the biological sequences of the generated simulated fastq files?

here are head 20 lines of the R1 and R2 fastq file, what is the cell barcodes, UMI and the biological sequences ? :

hgyi@sustc-HG:/data/hgyi/work/RNAseq_quantTool/scsimulation/minnow_splatter_out$ head -20 
hg_100_S1_L001_R1_001.fastq
@AAACCTGGTCTAGCGC:ENSMUST00000181274.1:1116:0:0
AAACCTGGTCTAGCGCGTTATCCTCA
+
NNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1357:1:1
AAACCTGGTCTAGCGCAAGGGACCTT
+
NNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1350:2:2
AAACCTGGTCTAGCGCGCTATGCACA
+
NNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1356:3:3
AAACCTGGTCTAGCGCTAACGTTTGG
+
NNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1341:4:4
AAACCTGGTCTAGCGCAAGGCCTTCT
+
NNNNNNNNNNNNNNNNNNNNNNNNNN

hgyi@sustc-HG:/data/hgyi/work/RNAseq_quantTool/scsimulation/minnow_splatter_out$ head -20 hg_100_S1_L001_R2_001.fastq
@AAACCTGGTCTAGCGC:ENSMUST00000181274.1:1116:0:0
GGACAAGGTTAAAGGGTTTTGTTTTGTTTTTCTTTTTTTCTTTCAGTAGTTAACCACCAGATTTGAACCAAAAAACATTTAGTGTTTAAAAAAAAAAAGA
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1357:1:1
TGTGGGGGCCTCACCTATGGCACAACCCCAGGGCGCCAGATTGCCTCTGGACCCTCTGTCACTGGGGGCAGCATCACAGTGATGGCCCCTGACTCCTGCT
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1350:2:2
AGTCACATGTGGGGGCCTCACCTATGGCACAACCCCAGGGCGCCAGATTGCCTCTGGACCCTCTGTCACTGGGGGCAGCATCACAGTGATGGCNCCTGAC
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1356:3:3
ATGTGGGGGGCTCACCTATGGCACAACCCCAGGGCGCCANATTGCCTCTGGACCCTCTGTCACTGGGGGCAGCATCACAGTGATGGCCCCTGACTCCTGC
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
@AAACCTGGTCTAGCGC:ENSMUST00000081945.4:1341:4:4
CCGTGGAGGAGTCACATGTGGGGGCCTCACCTATGGCACAACCCCAGGGCGCCAGATTGCCTCTGGACCCTCTGTCACTGGGGGCAGCATCACAGTGATG
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 
yhg926 commented 4 years ago

another question: I noticed you are using this reference: minnow_ind/ref_k101_fixed.fa, so what is its differences with the original reference gencode.vM25.transcripts.fa.gz ?, the simulated reads were generated from ref_k101_fixed.fa not gencode.vM25.transcripts.fa.gz , right ?

hiraksarkar commented 4 years ago

Hi,

The left end hg_100_S1_L001_R1_001.fastq contains the cell barcode and the UMI (16 + 10 bp), and the right end hg_100_S1_L001_R2_001.fastq contains the sequences.

Regarding the second question ref_k101_fixed.fa is the filtered version of gencode.vM25.transcripts.fa.gz, removing transcripts having a sequence length lesser than 101bp, (which is read length + 1).

yhg926 commented 4 years ago

what do the last three number means in R2 fastq head line? for example, in below fastq, the three number 1239:4147885:99998. It seems 1239 is offset from the transcript start, and 99998 is the read number? but i can not figure out what is the meaning of 4147885.

hgyi@sustc-HG:/data/hgyi/work/RNAseq_quantTool/scsimulation/minnow_splatter_out100$ tail -8 hg_100_S1_L001_R2_001.fastq
@TGCACCTTCGGCGCTA:ENSMUST00000195555.1:1239:4147885:99998
ACCCAGCGCACGGCCCCGGGCAACCTTCGCCCCCTCCCGAGGCTCTGCCCTGCCGGGATGGCACGGAATCCAACCAGCCCACTGAGCTCCTAGGGGAGGT
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
@TGCACCTTCGGCGCTA:ENSMUST00000171265.7:2690:1823702:99999
TGGGTTGGTGTGAATGTGTGTCACCTACACATTCTAACAGAAGGTAACAATACGTTAGCAGTGACATATTCAGTATTACTCATTTTGTAGTGCCAGGGAT
+
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
hiraksarkar commented 4 years ago

Hi,

That's a great question, 4147885:99998 this is generally used for other debugging, and is not directly usable by end user. 4147885 is the index of read in the order they are dumped by the thread (not necessarily in the same order they appear in the file, since it's a multi-threaded code). 99998 is the id number of read before doing PCR.

However, these two numbers are of no use for other analyses.

PS: If you can please format the input before pasting to the issue that would be really helpful for me to inspect the log from phone or other devices

Thanks

hiraksarkar commented 4 years ago

Yes, for details please check https://academic.oup.com/bioinformatics/article/35/14/i136/5529127.

As the query is solved, I would close this issue, feel free to open issue for other questions.

hiraksarkar commented 4 years ago

Ok I think your last query seems like another question, so I am reopening.

hiraksarkar commented 4 years ago

Can you please elaborate what you mean by not mapping to 3' end of the reference. These reads are sampled from end of the transcripts.

hiraksarkar commented 4 years ago

I just validated an example file, all the reads seems to come from the 3' end of the transcripts. Let me know what is expected in your case, that you are not seeing.

yhg926 commented 4 years ago

I just validated an example file, all the reads seems to come from the 3' end of the transcripts. Let me know what is expected in your case, that you are not seeing.

Sorry, I thought i did not post the right question. The information i want to know is what is the mean distance and its variation between R2 reads and the 3' end of the reference transcripts. For example, i noticed this read : @AAACCTGAGGGTATCG:ENSMUST00000192913.1:897:24:24 it start from position 897 of ENSMUST00000192913.1, and ENSMUST00000192913.1 has length of 1506 so the interval between this read and end of its reference is 509

and this read: @AAACCTGAGGGTATCG:ENSMUST00000192847.5:673:249:249 start from position 673 of ENSMUST00000192847.5 , and ENSMUST00000192847.5 has length of 1662. so the interval between this read and end of its reference is 889.

i can compute the interval variations by myself, i am not quite familiar with single cell RNA-seq technique, but i think the variation is quite big, did these reflect the situation in the real sequencing data?

hiraksarkar commented 4 years ago

Hi @yhg926,

The maximum fragment length is set to 1000 according to Illumina fragmentation protocol, so the read sequences are sampled in order to imitate the multi-mapping rate rather than to imitate any kind of positional distribution. However, if you can elaborate on what you exactly want, we can try to modify minnow in order to fit your need.

If you want to see positional distribution please follow these steps, To check the edit distance distribution and the distribution of distance from end please run the following

src/validate validate -f <( gunzip -c hg_100_S1_L001_R2_001.fastq.gz ) -t minnow_ind/ref_k101_fixed.fa -o splatter_out/edit.dist -m 10

-m option is for edit distance, if you use a PCR cycle of 6, it's unlikely to see a sequence that is more than 6 edit distance apart. If you use a higher PCR cycle then use higher values to find the edit-distance distribution with arbitrary high edit distances,

The edit distance distribution will be followed by a distribution of the distance from end. A snapshot of the output from the above command is as follows,

With the following format

edit distance <number of reads with that edit distance>
.
.
--
distance from end <number of reads with that distance from end>
.
.

Example,

➜  build git:(minnow-velocity) ✗ head -20 splatter_out/edit.dist
0       606398
1       304145
2       75481
3       12264
4       1537
5       157
6       18
--
100     6522
101     6490
102     6637
103     6604
104     6161
105     6131
106     5891
107     5887
108     5213
109     5448
110     5405
111     5295
...

The lines below -- show the positional distribution of distance from end.

hiraksarkar commented 4 years ago

Hi @yhg926 Do you have any more questions/doubts, otherwise, I would close this issue.

yhg926 commented 4 years ago

no further questions. thank you!

Sent from my iPhone

On Oct 4, 2020, at 11:37 AM, Hirak Sarkar notifications@github.com wrote:

 Hi @yhg926 Do you have any more questions/doubts, otherwise, I would close this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.