Closed KC-Lan closed 11 months ago
Hi,
Yes, that's the reason for the 1-base shift. However, the effect if minimized when you map the tss on the 10Kb or 100Kb bins. Regarding your second question, I have also seen those cases. For example, when I do the following the gtf file itself has those ambiguous records -
zcat hg38.refGene.gtf.gz |grep "FAM138F" |awk '$3 == "transcript" {print}'
chr1 refGene transcript 34611 36081 . - . gene_id "FAM138F"; transcript_id "NR_026820"; gene_name "FAM138F";
chr19 refGene transcript 76220 77690 . - . gene_id "FAM138F"; transcript_id "NR_026820_2"; gene_name "FAM138F";
chr9 refGene transcript 34394 35864 . - . gene_id "FAM138F"; transcript_id "NR_026820_3"; gene_name "FAM138F";
Although the gene names are same, the transcript IDs are different. There is no way to avoid these, unless you generate your own manually curated reference bed. Meanwhile, I did a calculation and find out that there are 63 genes on the reference chromosomes (chr1-22, X) which shows such ambiguity (Didn't check if they have the same transcript id or not). The set in the final analysis will be distributed across the larger bins and is unlikely to cause any issue during gene enrichment analysis.
Thank you for your prompt reply and explanation.
Hi, Thank you for developing this useful tool.
In the gene enrichment analysis part, dcHiC generates a
refgene.bed
file within thegeneEnrichment/
directory. I used this BED file in IGV and noticed a one-base shift when compared with the Refseq Genes. According to the Bedtools documentation, the start position in BED files is zero-based, while the GFF/GTF format employs 1-based coordinates for both start and end positions. Upon reviewing the original R code in dcHiC, it appears that therefgene.bed
file is derived from the.gtf
file downloaded from the UCSC golden path. This might be the reason for the 1 base shift. This 1 base difference also found in other.bed
file, e.g.hg38.tss.bed
.Additionally, I observed that certain genes in the
refgene.bed
file are matched to different chromosomes (Genome: GRCh38/hg38). For instance, the regionchr1 34611 36081
associated with theFAM138A
gene is also labeled for theFAM138C
gene (chr9 34394 35864) and theFAM138F
gene (chr19 76220 77690). I am uncertain about the potential impact of these findings on the geneEnrichment analysis. Your insights into this matter would be greatly appreciated.Thank you for your time and assistance.
dcHiC_refgene.bed.txt