JiekaiLab / scTE

MIT License
87 stars 27 forks source link

"IndexError: list index out of range" when using scTE_build #95

Closed BitterWood closed 3 weeks ago

BitterWood commented 3 weeks ago

Thanks for the nice tool, and I have successfully finished several tasks with scTE. This time however, as I want to use my customs reference by the command scTE_build -te ../4_anno/mm10.erv.bed -gene ../4_anno/mm10.chr.gtf -o custome I met the error as:

Namespace(genefile=['../4_anno/mm10.chr.gtf'], genome='other', info=<function info at 0x7f5e0b4e4950>, mode='exclusive', out=None, tefile=['../4_anno/mm10.erv.bed']) INFO : Building the scTE genome annotation index... 2024-05-30 14:07:54 Traceback (most recent call last): File "/home/user/scTE/bin/scTE_build", line 468, in main() File "/home/user/scTE/bin/scTE_build", line 461, in main genomeIndex(args.genome,args.mode,tefile,genefile, args.out,'No path','No path') File "/home/user/scTE/bin/scTE_build", line 127, in genomeIndex gls.load_list(clean) File "/home/user/scTE/bin/../scTE/miniglbase/genelist.py", line 1472, in load_list list_to_load[0] IndexError: list index out of range

I find this issue similar with #79 , but it remains open. As some other people met the same problem, I add this issue. Sorry for the repetition of the same question.

The first six lines of my gtf and bed files are as follows: gtf:

chrY ncbiRefSeq.2021-04-23 transcript 90836782 90843932 . - . gene_id "LOC108168645"; transcript_id "XR_001782923.2"; gene_name "LOC108168645"; chrY ncbiRefSeq.2021-04-23 exon 90836782 90837279 . - . gene_id "LOC108168645"; transcript_id "XR_001782923.2"; exon_number "5"; exon_id "XR_001782923.2.5"; gene_name "LOC108168645"; chrY ncbiRefSeq.2021-04-23 exon 90838777 90839986 . - . gene_id "LOC108168645"; transcript_id "XR_001782923.2"; exon_number "4"; exon_id "XR_001782923.2.4"; gene_name "LOC108168645"; chrY ncbiRefSeq.2021-04-23 exon 90840377 90840536 . - . gene_id "LOC108168645"; transcript_id "XR_001782923.2"; exon_number "3"; exon_id "XR_001782923.2.3"; gene_name "LOC108168645"; chrY ncbiRefSeq.2021-04-23 exon 90842233 90842312 . - . gene_id "LOC108168645"; transcript_id "XR_001782923.2"; exon_number "2"; exon_id "XR_001782923.2.2"; gene_name "LOC108168645"; chrY ncbiRefSeq.2021-04-23 exon 90843818 90843932 . - . gene_id "LOC108168645"; transcript_id "XR_001782923.2"; exon_number "1"; exon_id "XR_001782923.2.1"; gene_name "LOC108168645"; chrY ncbiRefSeq.2021-04-23 transcript 90836782 90843932 . - . gene_id "LOC108168645"; transcript_id "XR_001782926.2"; gene_name "LOC108168645"; chrY ncbiRefSeq.2021-04-23 exon 90836782 90837279 . - . gene_id "LOC108168645"; transcript_id "XR_001782926.2"; exon_number "5"; exon_id "XR_001782926.2.5"; gene_name "LOC108168645"; chrY ncbiRefSeq.2021-04-23 exon 90838777 90839594 . - . gene_id "LOC108168645"; transcript_id "XR_001782926.2"; exon_number "4"; exon_id "XR_001782926.2.4"; gene_name "LOC108168645"; chrY ncbiRefSeq.2021-04-23 exon 90840377 90840536 . - . gene_id "LOC108168645"; transcript_id "XR_001782926.2"; exon_number "3"; exon_id "XR_001782926.2.3"; gene_name "LOC108168645";

bed:

chr1 3008500 3009169 RLTR26B_MM 519.7 + chr1 3011641 3011780 RLTR25B 101.0 - chr1 3011775 3012534 RLTR25B 533.0 - chr1 3026805 3027112 ERVB7_1-LTR_MM 361.3 + chr1 3028230 3028592 RLTR14 212.8 - chr1 3029686 3029815 MLT1A1 19.5 + chr1 3028940 3029702 RMER17A 633.5 - chr1 3030069 3030157 RLTR13E 59.9 - chr1 3030069 3030199 RLTR13B4 48.5 - chr1 3031358 3031710 IAPLTR1_Mm 398.8 -

Many thanks for your help!

jphe commented 3 weeks ago

Could you please confirm if your BED file is a 6-column file delimited by tabs (\t)?

BitterWood commented 3 weeks ago

Could you please confirm if your BED file is a 6-column file delimited by tabs (\t)?

Thanks for your quick reply. I check this by

head -n 6 mm10.erv.bed | awk -F'\t' '{if(NF==6) print "Line", NR, "has 6 columns"; else print "Line", NR, "does not have 6 columns"}'

And I receive the follows:

Line 1 has 6 columns Line 2 has 6 columns Line 3 has 6 columns Line 4 has 6 columns Line 5 has 6 columns Line 6 has 6 columns

As the log shows, I suppose my BED file is a 6-column file delimited by tabs (\t).

BitterWood commented 3 weeks ago

Could you please confirm if your BED file is a 6-column file delimited by tabs (\t)?

Hello @jphe , I re-check my BED and GTF files. As I use scTE_build with my BED file and Gene.gtf provided by scTE, the command works successfully, but as I run the command with my GTF file and TE.bed provided by scTE, I meet the error. So the problem should be the GTF file. Then I run scTE_build with my BED file and the GTF file downloaded separately from

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M21/gencode.vM21.annotation.gtf.gz

or

https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ncbiRefSeq.gtf.gz

For the result, the GENCODE GTF file works successfully as the UCSC one fails. I suppose there should be some differences when extracting information from these two GTF files, but I fail to figure them out. Could you please offer some help?

The first six lines of these two GTF files are as follows:

GENCODE GTF:

description: evidence-based annotation of the mouse genome (GRCm38), version M21 (Ensembl 96)

provider: GENCODE

contact: gencode-help@ebi.ac.uk

format: gtf

date: 2019-03-27

chr1 HAVANA gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1"; chr1 HAVANA transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1"; chr1 HAVANA exon 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; exon_number 1; exon_id "ENSMUSE00001343744.1"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1"; chr1 ENSEMBL gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842.1"; gene_type "snRNA"; gene_name "Gm26206"; level 3; chr1 ENSEMBL transcript 3102016 3102125 . + . gene_id "ENSMUSG00000064842.1"; transcript_id "ENSMUST00000082908.1"; gene_type "snRNA"; gene_name "Gm26206"; transcript_type "snRNA"; transcript_name "Gm26206-201"; level 3; transcript_support_level "NA"; tag "basic";

UCSC GTF:

chrM ncbiRefSeq.2021-04-23 transcript 15356 15422 . - . gene_id "TrnP"; transcript_id "rna-TrnP"; gene_name "TrnP"; chrM ncbiRefSeq.2021-04-23 exon 15356 15422 . - . gene_id "TrnP"; transcript_id "rna-TrnP"; exon_number "1"; exon_id "rna-TrnP.1"; gene_name "TrnP"; chrM ncbiRefSeq.2021-04-23 transcript 15289 15355 . + . gene_id "TrnT"; transcript_id "rna-TrnT"; gene_name "TrnT"; chrM ncbiRefSeq.2021-04-23 exon 15289 15355 . + . gene_id "TrnT"; transcript_id "rna-TrnT"; exon_number "1"; exon_id "rna-TrnT.1"; gene_name "TrnT"; chrM ncbiRefSeq.2021-04-23 transcript 14145 15288 . + . gene_id "CYTB"; transcript_id "NP_904340.1"; gene_name "CYTB"; chrM ncbiRefSeq.2021-04-23 exon 14145 15288 . + . gene_id "CYTB"; transcript_id "NP_904340.1"; exon_number "1"; exon_id "NP_904340.1.1"; gene_name "CYTB"; chrM ncbiRefSeq.2021-04-23 CDS 14145 15288 . + 0 gene_id "CYTB"; transcript_id "NP_904340.1"; exon_number "1"; exon_id "NP_904340.1.1"; gene_name "CYTB"; chrM ncbiRefSeq.2021-04-23 start_codon 14145 14147 . + 0 gene_id "CYTB"; transcript_id "NP_904340.1"; exon_number "1"; exon_id "NP_904340.1.1"; gene_name "CYTB"; chrM ncbiRefSeq.2021-04-23 transcript 14071 14139 . - . gene_id "TrnE"; transcript_id "rna-TrnE"; gene_name "TrnE"; chrM ncbiRefSeq.2021-04-23 exon 14071 14139 . - . gene_id "TrnE"; transcript_id "rna-TrnE"; exon_number "1"; exon_id "rna-TrnE.1"; gene_name "TrnE";

Desparating for your help. Many thanks again.

jphe commented 3 weeks ago

I'm not familiar with UCSC gtf, the simplest way is to convert the UCSC gtf to GENCODE style.

BitterWood commented 3 weeks ago

I'm not familiar with UCSC gtf, the simplest way is to convert the UCSC gtf to GENCODE style.

To be honest, until now my tasks are all based on the UCSC GTF, and my TE BED is based on the UCSC annotation as well. I've finished my task this time with GENCODE GTF, and I'm going to check whether it's OK when making further analysis.

Thank you for your time. Good luck in your work!