SpatialTranscriptomicsResearch / st_pipeline

ST Pipeline contains the tools and scripts needed to process and analyze the raw files generated with the Spatial Transcriptomics method in FASTQ format.
Other
76 stars 31 forks source link

I've met some Error in `Start filtering raw reads` #114

Closed BioAIEvolu closed 4 years ago

BioAIEvolu commented 4 years ago

Hi~

   When I try to use the ST_analysis to process the fastq dataset SRR3382371_R1.fastq and SRR3382371_R2.fastq which is the classical dataset from the  Paper "Visualization and analysis of gene expression in tissue sections by spatial transcriptomics", I meet some Error which I can't figure out.

real 0m7.029s user 0m0.701s sys 0m0.177s

[1]+ Exit 1 time st_pipeline_run.py --expName test1 --ids ../ids/1000L2_barcodes.txt --ref-map ../index --log-file log.txt --output-folder ./results --ref-annotation ../reference/Homo_sapiens.GRCh38.98.gtf --mapping-threads 16 --temp-folder ./tmp ../FASTQ/fastq_dump_result/SRR3382371_R1.fastq ../FASTQ/fastq_dump_result/SRR3382371_R2.fastq

and log.txt show the step which is raise Error:

INFO:STPipeline:Allowing 0 mismatches when removing homopolymers INFO:STPipeline:Starting the pipeline: 2019-12-02 04:10:21.549805 INFO:STPipeline:Start filtering raw reads 2019-12-02 04:10:21.551349

I guess may the id format of fastq has problem,but however I change the id, even the file name, the error raise just like before.
before change:

(base) [root@host-192-168-1-8 test]# head -n4 ../FASTQ/fastq_dump_result/SRR3382371.1_1.fastq.bak @SRR3382371.1.1 1 length=31 NCATGTGTCGTTTCAAGATGGGCCTTATTTT +SRR3382371.1.1 1 length=31

AAAAFFFFFFFFFFFFFFFFFFFFFFFAFF

(base) [root@host-192-168-1-8 test]# head -n4 ../FASTQ/fastq_dump_result/SRR3382371.1_2.fastq.bak @SRR3382371.1.1 1 length=120 GGTGATAGCTGGTTACCCAAAAAATGAATTTAAGTTCAATTTTAAACTTGCTAAAAAAACAACAAAATCAAAAAGTAAGTTTAGATTATAGCCAAAAGAGGGACAGCTCTTCTGGAACGG +SRR3382371.1.1 1 length=120 AAAAAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFF7FFFFFFFFAAFFFFFF<FFFFFFFFFFFFFFFFFFF.FFFF7FFFFFFFFFFFFAFFAF<F<F

after change:

(base) [root@host-192-168-1-8 test]# head -n4 ../FASTQ/fastq_dump_result/SRR3382371_R1.fastq @SRR3382371.1.1 1:N NCATGTGTCGTTTCAAGATGGGCCTTATTTT +SRR3382371.1.1 1:N

AAAAFFFFFFFFFFFFFFFFFFFFFFFAFF

(base) [root@host-192-168-1-8 test]# head -n4 ../FASTQ/fastq_dump_result/SRR3382371_R2.fastq @SRR3382371.1.1 2:N GGTGATAGCTGGTTACCCAAAAAATGAATTTAAGTTCAATTTTAAACTTGCTAAAAAAACAACAAAATCAAAAAGTAAGTTTAGATTATAGCCAAAAGAGGGACAGCTCTTCTGGAACGG +SRR3382371.1.1 2:N AAAAAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFF7FFFFFFFFAAFFFFFF<FFFFFFFFFFFFFFFFFFF.FFFF7FFFFFFFFFFFFAFFAF<F<F


 I hope to receive your reply as soon as possible. thx:)
jfnavarro commented 4 years ago

Is ./tmp a valid path?

BioAIEvolu commented 4 years ago

yes, the programe is defaultly set the tmp-folder in /tmp, but my root folder size is too small to run the st_pipline.py by default setting, so I change it to the bigger space disc

BioAIEvolu commented 4 years ago

And I 've already made the tmp in the work folder,becuase I knew it will raise error if I don't make the valid path

JSheng2023 commented 4 years ago

Have you tried to use the absolute path of tmp folder instead of relative path?

jfnavarro commented 4 years ago

As @sjtlzh123 suggested I would try absolute paths.

BioAIEvolu commented 4 years ago

I've check the fastq file, I figure out the problem is something wrong in transform the sra to fastq.Then I transform into fastq file again and rerun the programe,it seems all right.

min_reads_unique_event: 0.0
avergage_gene_feature: 900.1198801198801
average_reads_feature: 2070.7432567432566
ST Pipeline, run completed!

and get two output file

 141M Dec  3 16:44 test1_reads.bed
  81M Dec  3 16:45 test1_stdata.tsv

Are the result ok? Because I think it is too many Zero

$ less -S test1_stdata.tsv
        ENSG00000227232 ENSG00000279457 ENSG00000228463 ENSG00000236679 ENSG00000225972 ENSG00000225630 ENSG00000237973 ENSG00000229344 ENSG00000248527 ENSG00000198744 ENSG
11x15   1.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     2.0     0.0     0.0     0.0     0.0     0.0
13x21   1.0     1.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     2.0     0.0     0.0     0.0     0.0     0.0
17x17   0.0     1.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0

and I think it is dfferent to the dataset provided:

$ ll -h
8.8M Dec  3 22:16 Rep1_MOB_count_matrix-1.tsv

$ less -S Rep1_MOB_count_matrix-1.tsv
        Nop58   Arl6ip4 Lix1    Chrm1   Nap1l1  Kat6a   Fam134c Lrpprc  Srgap3  Slc1a3  Pde4dip Sestd1  0610007N19Rik   Pigu    Man1b1  2010300C02Rik   Sox2    Cbx5    Sh3g
17.002x8.987    1       5       4       2       2       1       8       1       3       7       3       13      1       3       2       3       18      8       4       3
17.889x8.992    0       1       2       2       4       8       0       0       10      15      3       3       0       0       5       11      2       7       3       4
19.855x8.988    1       0       0       1       2       0       0       0       2       27      6       7       0       0       0       0       2       1       0       2

So Is someting wrong in my run?

JSheng2023 commented 4 years ago

I can only tell that my data matrix also contains lots of zeroes. As long as major cell markers and pathway genes can be detected, it should be fine for me.

BioAIEvolu commented 4 years ago

OK,thank you!I guess I maybe confuse the st_pipline output file with the input file of st_analysis.