MGI-tech-bioinformatics / DNBelab_C_Series_HT_scRNA-analysis-software

An open source and flexible pipeline to analysis high-throughput DNBelab C Series single-cell RNA datasets
MIT License
52 stars 20 forks source link

dnbc4rna run报错- data - ERROR Command failed with exit code -6 #55

Closed ZiggeyQi closed 5 hours ago

ZiggeyQi commented 5 months ago

技术团队的老师您好,不好意思打扰到您,我想用dnbc4tools rna multi运行多个样品时报错如下: Traceback (most recent call last): File "/home/liunyw/miniforge3/envs/DNBC4tools/bin/dnbc4tools", line 8, in sys.exit(main()) File "/home/liunyw/miniforge3/envs/DNBC4tools/lib/python3.8/site-packages/dnbc4tools/dnbc4tools.py", line 58, in main args.func(args) File "/home/liunyw/miniforge3/envs/DNBC4tools/lib/python3.8/site-packages/dnbc4tools/rna/multi.py", line 86, in multi Multi_list(args).run() File "/home/liunyw/miniforge3/envs/DNBC4tools/lib/python3.8/site-packages/dnbc4tools/rna/multi.py", line 33, in run cDNAr1 = get_abspath(lst[1].split(';')[0]) IndexError: list index out of range 但是我发现在我设定的output文件中已经成功生成了单个样品dnbc4tools rna run的代码,查看后也觉得没有问题,所以我就直接运行了output文件中生成的samplename.sh文件,生成代码如下: /home/liunyw/miniforge3/envs/DNBC4tools/bin/dnbc4rna run --name H1 --cDNAfastq1 /home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/cDNA-6788-1-231129/DP8480003117TL_L01_38_1.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/cDNA-6788-1-231129/DP8480003122TL_L01_38_1.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/cDNA-6788-2-231129/DP8480003117TL_L01_39_1.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/cDNA-6788-2-231129/DP8480003122TL_L01_39_1.fq.gz --cDNAfastq2 /home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/cDNA-6788-1-231129/DP8480003117TL_L01_38_2.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/cDNA-6788-1-231129/DP8480003122TL_L01_38_2.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/cDNA-6788-2-231129/DP8480003117TL_L01_39_2.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/cDNA-6788-2-231129/DP8480003122TL_L01_39_2.fq.gz --oligofastq1 /home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/oligo-6788-1-231129/DP8480003322TL_L01_62_1.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/oligo-6788-2-231129/DP8480003322TL_L01_63_1.fq.gz --oligofastq2 /home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/oligo-6788-1-231129/DP8480003322TL_L01_62_2.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/oligo-6788-2-231129/DP8480003322TL_L01_63_2.fq.gz --genomeDir /home/liunyw/qhs_data/genome/mm10_hsv --threads 60 在运行samplename.sh脚本后,发现报错如下: The chemistry(darkreaction) automatically determined in cDNA : darkreaction The chemistry(darkreaction) automatically determined in oligoR1 : darkreaction The chemistry(darkreaction) automatically determined in oligoR2 : darkreaction 2024-01-13 19:32:53 Processing cDNA library barcodes and aligning. 2024-01-13 19:32:53 Processing oligo library barcodes. 2024-01-13 22:38:45 Annotating BAM files. 2024-01-13 22:39:33,942 - data - ERROR Command failed with exit code -6 2024-01-13 22:39:33,942 - data - ERROR 2024-01-13 22:39:32.796 W process: stoi terminate called after throwing an instance of 'std::runtime_error' what(): Failed makeOverlapDetector! Please check annotation file format. 2024-01-13 22:39:34,040 - data - ERROR Command failed with exit code 1 2024-01-13 22:39:34,041 - data - ERROR [E::hts_open_format] Failed to open file "/home/liunyw/qhs_data/matrix/output/H1/01.data/final.bam" : No such file or directory samtools sort: can't open "/home/liunyw/qhs_data/matrix/output/H1/01.data/final.bam": No such file or directory Error: Cannot find input file or dir /home/liunyw/qhs_data/matrix/output/H1/01.data/final_sorted.bam Error: Cannot find input file or dir /home/liunyw/qhs_data/matrix/output/H1/02.count/filter_feature.h5ad 2024-01-13 22:40:01 Summarizing results and generating a report. Error: Cannot find input file or dir /home/liunyw/qhs_data/matrix/output/H1/02.count/cutoff.csv Analysis Finished Elapsed Time: 3 hours 7 minutes 37 seconds 最终只生成了01.data和04.report文件夹,但是04.report文件夹里是空的,01.data文件夹内容如下: drwxrwxr-x. 2 liunyw liunyw 4.0K Jan 13 22:39 . drwxrwxr-x. 5 liunyw liunyw 65 Jan 13 22:40 .. -rw-rw-r--. 1 liunyw liunyw 55M Jan 13 22:38 cDNA_barcode_counts_raw.txt -rw-rw-r--. 1 liunyw liunyw 392 Jan 13 19:32 cDNAin1 -rw-rw-r--. 1 liunyw liunyw 392 Jan 13 19:32 cDNAin2 -rw-rw-r--. 1 liunyw liunyw 509 Jan 13 19:32 cDNA_para -rw-rw-r--. 1 liunyw liunyw 391 Jan 13 22:38 cDNA_sequencing_report.csv -rw-rw-r--. 1 liunyw liunyw 39M Jan 13 20:26 Index_barcode_counts_raw.txt -rw-rw-r--. 1 liunyw liunyw 12G Jan 13 20:26 Index_reads.fq.gz -rw-rw-r--. 1 liunyw liunyw 378 Jan 13 20:26 Index_sequencing_report.csv -rw-rw-r--. 1 liunyw liunyw 2.0K Jan 13 22:38 Log.final.out -rw-rw-r--. 1 liunyw liunyw 11K Jan 13 22:38 Log.out -rw-rw-r--. 1 liunyw liunyw 22K Jan 13 22:38 Log.progress.out -rw-rw-r--. 1 liunyw liunyw 773 Jan 13 19:32 oligo_para -rw-rw-r--. 1 liunyw liunyw 45M Jan 13 22:38 SJ.out.tab 在star比对过程中是有align.bam文件的,因为是晚上跑的不知道中途发生了什么,早上来就发现上面的报错,bam文件也不见了,于是我就去检查了star比对的log文件,也没有发现明显的问题。还需要说明一下的是,比对使用的是mouse和病毒基因组合在一起的fa文件,病毒的gtf文件也是我自己修改过的,在运行了tools mkgtf和rna mkref这两个代码后能在输出的geneinfo和transcriptsinfo文件中看到病毒的基因信息,但是我害怕我修改的gtf文件有问题,所以我就把star比对的log文件和病毒修改后的gtf文件都上传了,mouse的genome和gtf文件就是按quickstar里的链接下载的,辛苦老师您帮忙看看我这个报错是哪里的问题,该如何解决,谢谢 log和gtf文件 GCF_000859985.2_ViralProj15217_genomic_edited_gtf.txt Log.final.txt Log.progress.txt Log.txt

lishuangshuang0616 commented 5 months ago

Can you check the gtf name you entered?

ZiggeyQi commented 5 months ago

我上传gtf文件到github的时候,发现.gtf文件没办法上传,所有把后缀改成了.txt上传的。在运行程序过程中实际使用的gtf如下: drwxrwxr-x. 2 liunyw liunyw 83 Jan 13 17:25 . drwxrwxr-x. 4 liunyw liunyw 52 Jan 13 17:22 .. -rw-rw-r--. 1 liunyw liunyw 844M Jan 13 11:47 gencode.vM23_GCF_000859985.gtf -rw-rw-r--. 1 liunyw liunyw 2.6G Jan 13 11:51 GRCm38_NC_001806.fa mouse部分的gtf就是按参考链接下载的应该没有问题,下面这部分是我修改后的病毒的gtf形式: NC_001806.2 RefSeq gene 2741611001 2741615401 . + . gene_id "HHV1gp00p22"; gene_type "protein_coding"; gene_name "UL52"; db_xref "GeneID:2703423"; gbkey "Gene"; gene "UL52"; gene_biotype "protein_coding"; locus_tag "HHV1gp00p22"; note "core gene"; NC_001806.2 RefSeq transcript 2741611001 2741614174 . + 0 gene_id "HHV1gp00p22"; transcript_id "unassigned_transcript_57"; gene_type "protein_coding"; gene_name "UL52"; gbkey "CDS"; gene "UL52"; locus_tag "HHV1gp00p22"; product "helicase-primase primase subunit"; protein_id "YP_009137128.1"; exon_number "1"; NC_001806.2 RefSeq exon 2741611001 2741614174 . + 0 gene_id "HHV1gp00p22"; transcript_id "unassigned_transcript_57"; gene_type "protein_coding"; gene_name "UL52"; gbkey "CDS"; gene "UL52"; locus_tag "HHV1gp00p22"; product "helicase-primase primase subunit"; protein_id "YP_009137128.1"; exon_number "1"; NC_001806.2 RefSeq CDS 2741611001 2741614174 . + 0 gene_id "HHV1gp00p22"; transcript_id "unassigned_transcript_57"; gene_type "protein_coding"; gene_name "UL52"; gbkey "CDS"; gene "UL52"; locus_tag "HHV1gp00p22"; product "helicase-primase primase subunit"; protein_id "YP_009137128.1"; exon_number "1"; NC_001806.2 RefSeq start_codon 2741611001 2741611003 . + 0 gene_id "HHV1gp00p22"; transcript_id "unassigned_transcript_57"; gene_type "protein_coding"; gene_name "UL52"; gbkey "CDS"; gene "UL52"; locus_tag "HHV1gp00p22"; product "helicase-primase primase subunit"; protein_id "YP_009137128.1"; exon_number "1"; NC_001806.2 RefSeq stop_codon 2741614175 2741614177 . + 0 gene_id "HHV1gp00p22"; transcript_id "unassigned_transcript_57"; gene_type "protein_coding"; gene_name "UL52"; gbkey "CDS"; gene "UL52"; locus_tag "HHV1gp00p22"; product "helicase-primase primase subunit"; protein_id "YP_009137128.1"; exon_number "1"; NC_001806.2 RefSeq gene 2741634053 2741635915 . + . gene_id "HHV1gp00p14"; gene_type "protein_coding"; gene_name "US1"; db_xref "GeneID:2703435"; gbkey "Gene"; gene "US1"; gene_biotype "protein_coding"; locus_tag "HHV1gp00p14"; note "immediate early gene"; NC_001806.2 RefSeq transcript 2741634081 2741635912 . + . gene_id "HHV1gp00p14"; transcript_id "unassigned_transcript_66"; gene_type "protein_coding"; gene_name "US1"; gbkey "mRNA"; gene "US1"; locus_tag "HHV1gp00p14"; product "regulatory protein ICP22"; transcript_biotype "mRNA"; NC_001806.2 RefSeq exon 2741634081 2741634327 . + . gene_id "HHV1gp00p14"; transcript_id "unassigned_transcript_66"; gene_type "protein_coding"; gene_name "US1"; gene "US1"; locus_tag "HHV1gp00p14"; product "regulatory protein ICP22"; transcript_biotype "mRNA"; exon_number "1"; NC_001806.2 RefSeq exon 2741634496 2741635912 . + . gene_id "HHV1gp00p14"; transcript_id "unassigned_transcript_66"; gene_type "protein_coding"; gene_name "US1"; gene "US1"; locus_tag "HHV1gp00p14"; product "regulatory protein ICP22"; transcript_biotype "mRNA"; exon_number "2"; NC_001806.2 RefSeq transcript 2741634599 2741635858 . + 0 gene_id "HHV1gp00p14"; transcript_id "unassigned_transcript_66"; gene_type "protein_coding"; gene_name "US1"; gbkey "CDS"; gene "US1"; locus_tag "HHV1gp00p14"; product "regulatory protein ICP22"; protein_id "YP_009137136.1"; exon_number "2"; NC_001806.2 RefSeq exon 2741634599 2741635858 . + 0 gene_id "HHV1gp00p14"; transcript_id "unassigned_transcript_66"; gene_type "protein_coding"; gene_name "US1"; gbkey "CDS"; gene "US1"; locus_tag "HHV1gp00p14"; product "regulatory protein ICP22"; protein_id "YP_009137136.1"; exon_number "2"; NC_001806.2 RefSeq CDS 2741634599 2741635858 . + 0 gene_id "HHV1gp00p14"; transcript_id "unassigned_transcript_66"; gene_type "protein_coding"; gene_name "US1"; gbkey "CDS"; gene "US1"; locus_tag "HHV1gp00p14"; product "regulatory protein ICP22"; protein_id "YP_009137136.1"; exon_number "2"; NC_001806.2 RefSeq start_codon 2741634599 2741634601 . + 0 gene_id "HHV1gp00p14"; transcript_id "unassigned_transcript_66"; gene_type "protein_coding"; gene_name "US1"; gbkey "CDS"; gene "US1"; locus_tag "HHV1gp00p14"; product "regulatory protein ICP22"; protein_id "YP_009137136.1"; exon_number "2"; NC_001806.2 RefSeq stop_codon 2741635859 2741635861 . + 0 gene_id "HHV1gp00p14"; transcript_id "unassigned_transcript_66"; gene_type "protein_coding"; gene_name "US1"; gbkey "CDS"; gene "US1"; locus_tag "HHV1gp00p14"; product "regulatory protein ICP22"; protein_id "YP_009137136.1"; exon_number "2";

lishuangshuang0616 commented 5 months ago

所以你的gencode.vM23_GCF_000859985.gtf是小鼠和病毒合并在一起的gtf是吗

ZiggeyQi commented 5 months ago

是的,基因组也是合在一起的,我的测序样本里有mouse的转录本也有病毒的转录本,我想区分哪些是病毒感染的细胞,以及病毒感染后会对宿主细胞的转录造成什么影响,基于这个目的和情况,我在跑上游分析的时候应该把基因组和gtf合在一起还是分开呢,如果老师你方便的话可以邮件联系你吗,我快毕业了,这个数据有点着急分析,现在卡住了,很难受,急急急。

lishuangshuang0616 commented 5 months ago

合在一起是可以的

ZiggeyQi commented 5 months ago

好的,我这个报错也太奇怪了,感觉报错信息里也没提供啥信息,所以也不知道咋办,求求了,快救救孩子吧

lishuangshuang0616 commented 5 months ago

能将log目录内的txt发一下吗

ZiggeyQi commented 5 months ago

_bt_log.txt zh 20240113.txt 只有这两个文件

ZiggeyQi commented 5 months ago

老师,我忘了说明一个细节了,就是我在修改病毒的gtf文件的时候除了修改了最后一列的基因相关信息,我还把第三列和第四列的位置信息修改了,因为病毒基因组是在mouse基因组后面的,就在原来病毒基因注释位置的基础上加上了mouse基因组的总长度,这个会不会有影响,因为我的病毒gtf格式是NCBI的,我一开始不知道会有问题,所以就只改了病毒gtf中的gene_biotype为gene_type,然后直接合并跑的tools mkgtf和rna mkref程序,虽然没有报错,但是输出的geneinfo和transcriptsinfo文件里面少了很多病毒基因,但是仍然有一些病毒基因出现在了里面,后来修改了最后一列注释信息和位置注释信息后就病毒基因全都有了,之前没修改位置信息的时候也能对得上基因组吗,我这个修改位置注释信息是不是画蛇添足导致报错了。

lishuangshuang0616 commented 5 months ago

你的病毒基因组的长度不对太长了,你的长度是你的病毒基因组自己的染色体长度,不应该加上mouse基因组。每条染色体都是从1开始的,用病毒原有的gtf位置就行。

ZiggeyQi commented 5 months ago

确实是这个问题,我使用了原始的病毒基因组位置注释,然后跑下面这个代码没有报错,但是只输出了01.data和log文件夹,这次有bam输出了,但是其他该输出的文件夹都没有。 /home/liunyw/miniforge3/envs/DNBC4tools/bin/dnbc4rna run --name H1 --cDNAfastq1 /home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/cDNA-6788-1-231129/DP8480003117TL_L01_38_1.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/cDNA-6788-1-231129/DP8480003122TL_L01_38_1.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/cDNA-6788-2-231129/DP8480003117TL_L01_39_1.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/cDNA-6788-2-231129/DP8480003122TL_L01_39_1.fq.gz --cDNAfastq2 /home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/cDNA-6788-1-231129/DP8480003117TL_L01_38_2.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/cDNA-6788-1-231129/DP8480003122TL_L01_38_2.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/cDNA-6788-2-231129/DP8480003117TL_L01_39_2.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/cDNA-6788-2-231129/DP8480003122TL_L01_39_2.fq.gz --oligofastq1 /home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/oligo-6788-1-231129/DP8480003322TL_L01_62_1.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/oligo-6788-2-231129/DP8480003322TL_L01_63_1.fq.gz --oligofastq2 /home/liunyw/qhs_data/0.rawdata/H1/6788-1-231129/oligo-6788-1-231129/DP8480003322TL_L01_62_2.fq.gz,/home/liunyw/qhs_data/0.rawdata/H1/6788-2-231129/oligo-6788-2-231129/DP8480003322TL_L01_63_2.fq.gz --genomeDir /home/liunyw/qhs_data/genome/mm10_hsv --threads 60 log文件夹里共生成了两个log文件,其中一个是空的,另一个里面的内容是: 第一行是代码信息就没有复制了 2024-01-16 22:17:04,734 - data - INFO mapped_dnbs: 1685770

2024-01-17 03:46:16,039 - data - INFO 2024-1-16 20:48:24 ..... started STAR run Jan 16 20:48:24 ..... loading genome Jan 16 20:49:34 ..... started mapping mapped_dnbs: 2354089 Jan 17 03:45:29 ..... finished mapping Jan 17 03:45:58 ..... finished successfully 这样没有报错也没有别的输出的情况,可以根据输出的bam文件继续进行下面的分析吗 这是01.data文件内容 drwxrwxr-x. 2 liunyw liunyw 4.0K Jan 17 10:19 . drwxrwxr-x. 4 liunyw liunyw 44 Jan 16 20:48 .. -rw-rw-r--. 1 liunyw liunyw 102G Jan 17 03:45 Aligned.out.bam -rw-rw-r--. 1 liunyw liunyw 55M Jan 17 03:45 cDNA_barcode_counts_raw.txt -rw-rw-r--. 1 liunyw liunyw 392 Jan 16 20:48 cDNAin1 -rw-rw-r--. 1 liunyw liunyw 392 Jan 16 20:48 cDNAin2 -rw-rw-r--. 1 liunyw liunyw 509 Jan 16 20:48 cDNA_para -rw-rw-r--. 1 liunyw liunyw 391 Jan 17 03:45 cDNA_sequencing_report.csv -rw-rw-r--. 1 liunyw liunyw 5.6G Jan 17 10:29 final.bam -rw-rw-r--. 1 liunyw liunyw 39M Jan 16 22:16 Index_barcode_counts_raw.txt -rw-rw-r--. 1 liunyw liunyw 12G Jan 16 22:16 Index_reads.fq.gz -rw-rw-r--. 1 liunyw liunyw 378 Jan 16 22:16 Index_sequencing_report.csv -rw-rw-r--. 1 liunyw liunyw 2.0K Jan 17 03:45 Log.final.out -rw-rw-r--. 1 liunyw liunyw 11K Jan 17 03:46 Log.out -rw-rw-r--. 1 liunyw liunyw 48K Jan 17 03:45 Log.progress.out -rw-rw-r--. 1 liunyw liunyw 773 Jan 16 20:48 oligo_para -rw-rw-r--. 1 liunyw liunyw 45M Jan 17 03:45 SJ.out.tab

lishuangshuang0616 commented 5 months ago

还没有跑完吧

ZiggeyQi commented 5 months ago

好像是的,我想请教一下这个程序大概会跑多久呢,如果我用60个核,谢谢

lishuangshuang0616 commented 5 months ago

和数据量有关,一般500M reads大概7-8个小时

ZiggeyQi commented 5 months ago

好的,那我再等等,谢谢。

ZiggeyQi commented 5 months ago

这次使用病毒原始位置注释文件的合并gtf跑还是报错,报错没说是gtf形式有问题了,报错如下: 2024-01-18 00:10:20,273 - data - ERROR Command failed with exit code 255 2024-01-18 00:10:20,273 - data - ERROR Not exists bam file: /home/liunyw/qhs_data/matrix/output/H1/01.data/Aligned.out.bam

2024-01-18 00:10:20,363 - data - ERROR Command failed with exit code 1 2024-01-18 00:10:20,364 - data - ERROR [E::hts_open_format] Failed to open file "/home/liunyw/qhs_data/matrix/output/H1/01.data/final.bam" : No such file or directory samtools sort: can't open "/home/liunyw/qhs_data/matrix/output/H1/01.data/final.bam": No such file or directory 程序运行过程中一会儿有bam文件一会儿又没有bam文件,final.bam也是出现过的后面又不见了,不知道为啥,然后我尝试就只使用mouse的基因组和gtf文件,然后就没有报错,是能正常跑完所有流程的。麻烦帮我看看,谢谢

lishuangshuang0616 commented 5 months ago

看下log目录下的最新文件内容,同时发送下你的病毒fastq文件,你之前比对是可以完成的,现在无法完成比较奇怪,不确定是改了什么信息。

ZiggeyQi commented 5 months ago

周末还帮忙解决bug,辛苦啦👍 前面的报错我已经初步解决了,谢谢你的耐心回复,我正准备回复你来着,简述一下出现问题的原因,给需要的人提供参考。 最前面的报错提示gtf格式有问题,然后按你的建议使用原始的病毒基因组位置注释信息,这样运行之后比对完成了,但后续的分析报错,但是这第二次的报错里面没有再提示有gtf格式的问题,但是大概率还是病毒基因组和gtf的问题,因为我只用mouse的基因组和gtf是能跑通的。 接着我把病毒gtf的gene_id和transcript_id的命名按genecode的格式改成了ENSG和ENST的格式,然后我偶然发现病毒的基因组文件和mouse的基因组文件格式不太一样,mouse的是每行60个碱基,病毒的是每行80个碱基,并且染色体命名那里格式是‘>NC_001806.2 Human herpesvirus 1 strain 17, complete genome’,于是我就把染色体信息这里改成了符合genecode格式的‘NC_001806.2 NC_001806.2’,同时把病毒基因组序列也改成每行60个碱基,最大的问题应该是出在了染色体信息注释这里出错了,以至于后面对不上,以上就是我更改的地方,然后重新用改好的基因组和gtf文件跑所有程序就没有报错了。 最后,再次感谢技术团队的付出,谢谢。