CostaLab / reg-gen

Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
https://reg-gen.readthedocs.io/
Other
106 stars 30 forks source link

RGT-HINT genome files are incompatible with nf-core ATAC-seq pipeline. #232

Open hyjforesight opened 2 years ago

hyjforesight commented 2 years ago

Hello RGT-HINT, I met error messages when running RGT-HINT footprinting:

rgt-hint footprinting --organism mm10 --paired-end --output-location /mnt/e/HYJ/2022_7_23_ATAC-seq/results/RGT_HINT/ --output-prefix footprint_WT --atac-seq /mnt/e/HYJ/2022_7_23_ATAC-seq/results/bwa/mergedReplicate/control.mRp.clN.sorted.bam /mnt/e/HYJ/2022_7_23_ATAC-seq/results/bwa/mergedReplicate/macs/broadPeak/control.mRp.clN_peaks.broadPeak
Report: The scikit HMM encountered errors when applied. in region (10,52417320,52418086). This iteration will be skipped.

I'm using bam files generated by nf-core pipeline. They used reference genomes which were downloaded on July 17, 2015. I believe the above error was caused by the coordinate inconsistency between their reference genome file and the reference genome file that RGT-HINT configured from Encode vM25. image

I think the way to solve this error is to replace the files under ~/rgtdata/mm10/ folder with the genome files nf-core pipeline used. The nf-core pipeline supplies genome.fa, genome.fai, and chrom.sizes files, so I can replace genome_mm10.fa, genome_mm10.fa.fai and chrom.sizes.mm10 under ~/rgtdata/mm10/ folder. image I know I can download gencode.annotation.gtf file matching nf-core versions from Gencode, but where can I download genes_Gencode_mm10.bed and genes_RefSeq_mm10.bed? Is it necessary to also replace genes_Gencode_mm10.bed and genes_RefSeq_mm10.bed matching the versions with nf-core?

Thanks! Best, Yuanjian

Delta-43 commented 1 year ago

I was facing a similar issue and turns out the naming of the chromosomes was the issue. Essentially the default files for the genomes in ~/rgtdata are in UCSC format whereas my dataset was using the NCBI format, basically chr1 vs 1. I fixed it by replacing the default fasta file, the gtf file and the chrom.sizes file.