Open johnlin89 opened 5 months ago
@johnlin89 You are right: LongGF is using gencode gtf and not compatible with other gtf format. For the reference, LongGF assumes that the standard chromosome names are like chr1, chr2, and so on, and does not allow "_" to remove non-standard chromosome.
Thanks @liuqianhn ! I just wanted to document my investigation and inform others if having a similar issue.
Could we perhaps add that the gtf needs to be gencode format to the README? Apologies if I missed that.
TLDR: If you cannot produce any fusions, it may be an issue with your GTF file.
The GTF file:
a) cannot use
_
in chromosome names b) in column 9, must have agene_name
field c) in column 9, must have eithergene_type
orgene_biotype
fieldIf you are not getting any fusions and you believe your GTF file abides by a) and b)....I believe you can make the LongGF more permissive to possibly handle c if you specify the pseudogene argument (set to something that is NOT 0 or 1). This will cause LongGF to use more GTF entries even if they are a pseudogene.
My workaround was different as detailed below.
Background
I was testing using LongGF using a human NCBI reference and GTF from https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9606/ GRCh38.p14 (downloaded the "Genome sequences" ie
GCF_000001405.40_GRCh38.p14_genomic.fna
and "Annotation features" ie genomic.gtf). I noticed that no fusions were detected. Since LongGF does not tolerate_
in chromosome names, I had changed my sorted bam and gtf file to use the "chr" chromosome naming convention instead of of the RefSeq naming convention ieNC_
. However, I still could not produce any fusions.Issue
From my investigation, when LongGF processes the GTF file it creates a
_sub_list
for each gene depending on thegene_type
which is abstracted from the gtf file. Thegene_type
is determined by the fieldsgene_type
orgene_biotype
in the gtf file. A_sub_list
is only created if thegene_type
is one of:from
get_gfFrombam.c
.Since the gtf file I was using lacks
gene_type
andgene_biotype
, the_sub_list
never gets created and is normally necessary for determining possible gene fusions candidates. _sub_list is iterated through in_gtf_struct_.c
-->_gene_entry_
::get_coding_ovlp
to determine_t_code_reg_ovlp
inget_gfFrombam.c
.This is why I did not produce any fusions.
Additionally, the gtf file I was using lacked the
gene_name
field.Resolution
The gtf file does contain
transcript_biotype
andgbkey
fields which seem that they could be used in a similar fashion togene_type
orgene_biotype
. Note - I modified some of the code in a branch and can create a pull request if easier.Possible values are:
transcript_biotype "miRNA" transcript_biotype "mRNA" transcript_biotype "ncRNA" transcript_biotype "rRNA" transcript_biotype "scRNA" transcript_biotype "snoRNA" transcript_biotype "snRNA" transcript_biotype "transcript" transcript_biotype "tRNA" gbkey "CDS" gbkey "Gene" gbkey "mRNA" gbkey "ncRNA" gbkey "rRNA" gbkey "tRNA"
I added the following in
_gtf_struc.c
and in
get_gfFrombam.c
: