CostaLab / reg-gen

Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
https://reg-gen.readthedocs.io/
Other
103 stars 30 forks source link

rgt-THOR IndexError: list index out of range #239

Open SophieEhres opened 1 year ago

SophieEhres commented 1 year ago

Hi all, I've seen a previous issue resolved with the same error, however I still get this error. I have aligned FASTQ files to the CHM13 genome (chm13v2.0.fa) using STAR.

head /Users/ehresms/computational/genomes/human/CHM13/fasta/chm13v2.0.fa chr1 CP068277.2 Homo sapiens isolate CHM13 chromosome 1 Caccctaaaccctaacccctaaccctaaccctaaccctaaccctaaccctaacccctaaaccctaaccctaaccctaacc ctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccct aaccctaaccctaacccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta accctaaccctaaccctaaccctaaccctaaccctaaccctaacccaaccctaaccctaaccctaaccctaaccctaacc ctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccct aaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaa ccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacc ctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccct aaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaa >

I am using the chromosome lengths computed in the STAR index.

head /Users/ehresms/computational/genomes/human/CHM13/STARgenome/chrNameLength.txt chr1 248387328 chr2 242696752 chr3 201105948 chr4 193574945 chr5 182045439 chr6 172126628 chr7 160567428 chr8 146259331 chr9 150617247 chr10 134758134

Here's the top of the header of one of my BAM files:

samtools view -H /Users/ehresms/computational/IP/SE_SE_MCF10A_IP_T2T_clean_sorted.bam @HD VN:1.6 SO:coordinate @SQ SN:chr1 LN:248387328 @SQ SN:chr2 LN:242696752 @SQ SN:chr3 LN:201105948 @SQ SN:chr4 LN:193574945 @SQ SN:chr5 LN:182045439 @SQ SN:chr6 LN:172126628 @SQ SN:chr7 LN:160567428 @SQ SN:chr8 LN:146259331 @SQ SN:chr9 LN:150617247 @SQ SN:chr10 LN:134758134 @SQ SN:chr11 LN:135127769 @SQ SN:chr12 LN:133324548 @SQ SN:chr13 LN:113566686 @SQ SN:chr14 LN:101161492 @SQ SN:chr15 LN:99753195 @SQ SN:chr16 LN:96330374 @SQ SN:chr17 LN:84276897 @SQ SN:chr18 LN:80542538 @SQ SN:chr19 LN:61707364 @SQ SN:chr20 LN:66210255 @SQ SN:chr21 LN:45090682 @SQ SN:chr22 LN:51324926>

I have used the same FASTA file as I used to generate the STAR index, I have tried regenerating the FASTA from the actual index directory, I have also tried to modify the header of the BAM file to remove the "chr", in which case I get the error:

warning: invalid contig chr1 Traceback (most recent call last): File "/Users/ehresms/opt/anaconda3/bin/rgt-THOR", line 8, in sys.exit(main()) File "/Users/ehresms/opt/anaconda3/lib/python3.9/site-packages/rgt/THOR/THOR.py", line 155, in main m, exp_data, func_para, init_mu, init_alpha, distr = train_HMM(region_giver, options, bamfiles, genome, File "/Users/ehresms/opt/anaconda3/lib/python3.9/site-packages/rgt/THOR/THOR.py", line 63, in train_HMM exp_data = initialize(name=options.name, dims=dims, genome_path=genome, regions=train_regions, File "/Users/ehresms/opt/anaconda3/lib/python3.9/site-packages/rgt/THOR/dpc_help.py", line 427, in initialize regionset.sequences.sort() AttributeError: 'NoneType' object has no attribute 'sequences' >

I am also running into the same issue when I align to another genome. I tried using hg38 with setupGenomicData.py:

chrom_sizes

/Users/ehresms/rgtdata/hg38/chrom.sizes.hg38

genome

/Users/ehresms/rgtdata/hg38/genome_hg38.fa>

And I've using the chromosome lengths from the STAR index for hg38.

Is there something I'm missing?

Thanks in advance for your help!