WGLab / LongGF

A computational algorithm and software tool for fast and accurate detection of gene fusion by long-read transcriptome sequencing
GNU General Public License v3.0
22 stars 1 forks source link

no fusions produced seem result from issues with different gtf file format #19

Open johnlin89 opened 5 months ago

johnlin89 commented 5 months ago

TLDR: If you cannot produce any fusions, it may be an issue with your GTF file.

The GTF file:

a) cannot use _ in chromosome names b) in column 9, must have a gene_name field c) in column 9, must have either gene_type or gene_biotype field

If you are not getting any fusions and you believe your GTF file abides by a) and b)....I believe you can make the LongGF more permissive to possibly handle c if you specify the pseudogene argument (set to something that is NOT 0 or 1). This will cause LongGF to use more GTF entries even if they are a pseudogene.

[pseudogene:0(default)/1]: Optional. Default=0: Not use pseudogene from the GTF file.
[Secondary_alignment:0(default)]: Optional. Default=0: not use secondary alignment.

My workaround was different as detailed below.

Background

I was testing using LongGF using a human NCBI reference and GTF from https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9606/ GRCh38.p14 (downloaded the "Genome sequences" ie GCF_000001405.40_GRCh38.p14_genomic.fna and "Annotation features" ie genomic.gtf). I noticed that no fusions were detected. Since LongGF does not tolerate _ in chromosome names, I had changed my sorted bam and gtf file to use the "chr" chromosome naming convention instead of of the RefSeq naming convention ie NC_. However, I still could not produce any fusions.

Issue

From my investigation, when LongGF processes the GTF file it creates a _sub_list for each gene depending on the gene_type which is abstracted from the gtf file. The gene_type is determined by the fields gene_type or gene_biotype in the gtf file. A _sub_list is only created if the gene_type is one of:

const char * gt_list2[100] = {"IG_C_gene","IG_D_gene","IG_J_gene","IG_LV_gene","IG_V_gene","TR_C_gene","TR_J_gene","TR_V_gene","TR_D_gene","nonsense_mediated_decay","non_stop_decay","protein_coding","ambiguous_orf","disrupted_domain"};

from get_gfFrombam.c .

Since the gtf file I was using lacks gene_type and gene_biotype , the _sub_list never gets created and is normally necessary for determining possible gene fusions candidates. _sub_list is iterated through in _gtf_struct_.c-->_gene_entry_::get_coding_ovlp to determine_t_code_reg_ovlp in get_gfFrombam.c.

This is why I did not produce any fusions.

Additionally, the gtf file I was using lacked the gene_name field.

Resolution

The gtf file does contain transcript_biotype and gbkey fields which seem that they could be used in a similar fashion to gene_type or gene_biotype. Note - I modified some of the code in a branch and can create a pull request if easier.

Possible values are:

transcript_biotype "miRNA" transcript_biotype "mRNA" transcript_biotype "ncRNA" transcript_biotype "rRNA" transcript_biotype "scRNA" transcript_biotype "snoRNA" transcript_biotype "snRNA" transcript_biotype "transcript" transcript_biotype "tRNA" gbkey "CDS" gbkey "Gene" gbkey "mRNA" gbkey "ncRNA" gbkey "rRNA" gbkey "tRNA"

I added the following in_gtf_struc.c

while (std::getline(oss, cur_g_str, ' ')){
                 cur_g_str = str_tolower(cur_g_str); // revise to tolerate lowcase and uppercase
                 if (cur_g_str.compare("gene_id")==0){
                     std::getline(oss, cur_g_id, ' ');
                     cur_g_id = cur_g_id.substr(1, cur_g_id.size()-3);
                 }else if (cur_g_str.compare("transcript_id")==0){
                     std::getline(oss, cur_t_id, ' ');
                     cur_t_id = cur_t_id.substr(1, cur_t_id.size()-3);
                 // MODIFICATION added gene field since new gtf replaced gene_name with gene
                 }else if (cur_g_str.compare("gene_name")==0 || cur_g_str.compare("gene")==0){
                     std::getline(oss, cur_g_name, ' ');
                     cur_g_name = cur_g_name.substr(1, cur_g_name.size()-3);
                 // MODIFICATION: added || cur_g_str.compare("gbkey")==0 || cur_g_str.compare("transcript_biotype"==0
                 // MODIFICATION made since the new gtf from NCBI does not have gene_type or gene_biotype so cur_g_type = '' which causes is_g to be false and _sub_list never gets created from the initial gene entry so each gene does not have a _sub_list in m_gene_list. _sub_list is iterated through in _gtf_struct_.c-->_gene_entry_::get_coding_length to determine _t_code_reg_ovlp in get_gfFrombam.c. Since isg_i is false due to lack of gene_type/gene_biotype in the gtf, _sub_list never gets created for each gene and therefore _t_code_reg_ovlp is 0. If _t_code_reg_ovlp is 0 for a given gene entry there is no potential fusion detected and no fusions are detected. Also created gt_list3 in get_gfFrombam.c.
                 }else if (cur_g_str.compare("gene_type")==0 || cur_g_str.compare("gene_biotype")==0 || cur_g_str.compare("gbkey")==0 || cur_g_str.compare("transcript_biotype")==0){// add gene_biotype to consider other types
                     std::getline(oss, cur_g_type, ' ');
                     cur_g_type = cur_g_type.substr(1, cur_g_type.size()-3);

and in get_gfFrombam.c:

int m_check_gene_fusion(const char* in_bam_file, const char* in_gtf_file, const int min_len_ovlp, const int64_t _bin_size, const int _used_pseudogene, const int64_t _min_map_len, const int _used_secondary_alignment, const int _min_sup_read, const int output_flag){
   std::map<std::string, std::map<std::string, std::shared_ptr<_gtf_entry_> > > m_gene_list;
   std::map<std::string, std::string> _gid_to_gn;
   const char * gt_list1[100] = {"IG_C_gene","IG_D_gene","IG_J_gene","IG_LV_gene","IG_V_gene","TR_C_gene","TR_J_gene","TR_V_gene","TR_D_gene","IG_pseudogene","IG_C_pseudogene","IG_J_pseudogene","IG_V_pseudogene","TR_V_pseudogene","TR_J_pseudogene","nonsense_mediated_decay","non_stop_decay","protein_coding","ambiguous_orf","pseudogene","processed_pseudogene","polymorphic_pseudogene","transcribed_processed_pseudogene","transcribed_unprocessed_pseudogene","transcribed_unitary_pseudogene","translated_processed_pseudogene","translated_unprocessed_pseudogene","unitary_pseudogene","unprocessed_pseudogene","disrupted_domain"};
   int gt_size1 = 30;
   const char * gt_list2[100] = {"IG_C_gene","IG_D_gene","IG_J_gene","IG_LV_gene","IG_V_gene","TR_C_gene","TR_J_gene","TR_V_gene","TR_D_gene","nonsense_mediated_decay","non_stop_decay","protein_coding","ambiguous_orf","disrupted_domain"};
   int gt_size2 = 14;
   //MODIFICATION commented the below
   //int gt_size = gt_size2;
   //MODIFICATION added gt_list3 lines below
   const char * gt_list3[100] = {"IG_C_gene","IG_D_gene","IG_J_gene","IG_LV_gene","IG_V_gene","TR_C_gene","TR_J_gene","TR_V_gene","TR_D_gene","nonsense_mediated_decay","non_stop_decay","protein_coding","ambiguous_orf","disrupted_domain", "mRNA", "transcript", "CDS", "Gene"};
   int gt_size3 = 18;
   int gt_size = gt_size3;
   const char ** gt_list;
   if (_used_pseudogene==1){
      gt_size = gt_size1;
      gt_list = gt_list1;
   }else if (_used_pseudogene==0){
      gt_size = gt_size2;
      gt_list = gt_list2;
   //MODIFICATION added the chunk below
   }else if (_used_pseudogene==2){
      gt_size = gt_size3;
      gt_list = gt_list3;
   }else{// revise to consider all without filter.
      gt_size = 0;
      gt_list = NULL;
   }
   int retv = read_gtf(m_gene_list, _gid_to_gn, in_gtf_file, gt_list, gt_size);
   //MODIFICATION changed the default value of _used_pseudogene from 0 to 2
   int _used_pseudogene = 2;
liuqianhn commented 5 months ago

@johnlin89 You are right: LongGF is using gencode gtf and not compatible with other gtf format. For the reference, LongGF assumes that the standard chromosome names are like chr1, chr2, and so on, and does not allow "_" to remove non-standard chromosome.

johnlin89 commented 5 months ago

Thanks @liuqianhn ! I just wanted to document my investigation and inform others if having a similar issue.

Could we perhaps add that the gtf needs to be gencode format to the README? Apologies if I missed that.