ay-lab / dcHiC

dcHiC: Differential compartment analysis for Hi-C datasets
MIT License
55 stars 10 forks source link

Question regarding the 'refgene.bed' file in geneEnrichment analysis #73

Closed KC-Lan closed 11 months ago

KC-Lan commented 11 months ago

Hi, Thank you for developing this useful tool.

In the gene enrichment analysis part, dcHiC generates a refgene.bed file within the geneEnrichment/ directory. I used this BED file in IGV and noticed a one-base shift when compared with the Refseq Genes. According to the Bedtools documentation, the start position in BED files is zero-based, while the GFF/GTF format employs 1-based coordinates for both start and end positions. Upon reviewing the original R code in dcHiC, it appears that the refgene.bed file is derived from the .gtf file downloaded from the UCSC golden path. This might be the reason for the 1 base shift. This 1 base difference also found in other .bed file, e.g. hg38.tss.bed .

Additionally, I observed that certain genes in the refgene.bed file are matched to different chromosomes (Genome: GRCh38/hg38). For instance, the region chr1 34611 36081 associated with the FAM138A gene is also labeled for the FAM138C gene (chr9 34394 35864) and the FAM138F gene (chr19 76220 77690). I am uncertain about the potential impact of these findings on the geneEnrichment analysis. Your insights into this matter would be greatly appreciated.

Thank you for your time and assistance.

dcHiC_refgene.bed.txt

ay-lab commented 11 months ago

Hi,

Yes, that's the reason for the 1-base shift. However, the effect if minimized when you map the tss on the 10Kb or 100Kb bins. Regarding your second question, I have also seen those cases. For example, when I do the following the gtf file itself has those ambiguous records -

zcat hg38.refGene.gtf.gz |grep "FAM138F" |awk '$3 == "transcript" {print}'
chr1    refGene transcript      34611   36081   .       -       .       gene_id "FAM138F"; transcript_id "NR_026820";  gene_name "FAM138F";
chr19   refGene transcript      76220   77690   .       -       .       gene_id "FAM138F"; transcript_id "NR_026820_2";  gene_name "FAM138F";
chr9    refGene transcript      34394   35864   .       -       .       gene_id "FAM138F"; transcript_id "NR_026820_3";  gene_name "FAM138F";

Although the gene names are same, the transcript IDs are different. There is no way to avoid these, unless you generate your own manually curated reference bed. Meanwhile, I did a calculation and find out that there are 63 genes on the reference chromosomes (chr1-22, X) which shows such ambiguity (Didn't check if they have the same transcript id or not). The set in the final analysis will be distributed across the larger bins and is unlikely to cause any issue during gene enrichment analysis.

KC-Lan commented 11 months ago

Thank you for your prompt reply and explanation.