Open pintoa1-mskcc opened 1 year ago
Hi again,
I've been investigating this issue and I see that when a gene_name does not exist in the gene_bed file, something does wrong with the gene order algorithm which assigns the fusion category/type. Do you all have any suggestions on how to find the issue?
Hey there,
Sorry for going dark for so long. This completely fell off my radar.
From the main Metafusion.sh
script, it looks like gene_bed
is used twice.
That said, it's possible that the error is coming from downstream if there's a problem when reading the transformed data. Perhaps there's an uncaught case that's producing a default value.
Do you happen to know which output file is the first one affected? That might help us pinpoint which function/script needs to be looked at.
Cheers, Michael
The part that I saw having the effect is during reann_cff_fusion.py. I think this is because of faulty logic during an if else statement here: https://github.com/ccmbioinfo/MetaFusion/blob/81df5123ffca922f3c35f0639c640c14badde40e/scripts/pygeneann_MetaFusion.py#L874-L888
Particularly:
if score1 == max_t2[0] and gname1==self.t_gene1:
During this step, if self.t_gene1 does not exist in the list of genes you are iterating over (genes1), then you will only ever change the gene order when the score increases, but if the score is always the lowest possible something seems to fall out...
Also I just noticed its an if -> if statement for score1, but elsif for score 2
Ah, I remember asking one of the authors about Line 877. Unfortunately, I never got a satisfactory response about that. The equality comparison between score1
and max_t2
looks like typo to me. As you mentioned previously, it would make more sense to compare to max_t1
.
Also I just noticed its an if -> if statement for score1, but elsif for score 2
My guess is that there were multiple contributors and someone noticed the score1 == max_t2[0]
expression. Using two separate ifs makes (slightly) more sense than an else-if... however, neither of them really makes a lot of sense to begin with. If we correct line 877, then both for-loops should work the same because the two if-conditions are mutually exclusive.
if self.t_gene1 does not exist in the list of genes you are iterating over (genes1), then you will only ever change the gene order when the score increases, but if the score is always the lowest possible something seems to fall out
You'd be correct in that regard. If all the scores are 0 and t_gene1
is not in the bed file, then it looks like the self.reann_gene1
becomes ""
. Perhaps there could be a check to prevent this, but I'm not sure what would be a good fallback value.
Upon further analysis, I found that the issue I was facing was actually a dual issue:
If a gene name is not in the gene bed, all genes that are within range of the chromosome:breakpoint are analyzed. Gene name that creates the highest scoring gene order will be selected for reann gene. Any annotations from here on will be based on potentially different gene than originally called
if there is NO entry in the gene bed within range for the called chr:bp, no gene name will be selected for reann.
A flag for when these things occurs to tell the analyst that the fusion may not be the correct annotation in these cases would be very helpful.
Additionally I think and gname1==self.t_gene1:
this is unnecessary because earlier the tool filtered the input gene list to select the gene_name that is self.t_gene1 if self.t_gene1 is in the gene list.
A flag for when these things occurs to tell the analyst that the fusion may not be the correct annotation in these cases would be very helpful
I think that should be possible if you edit the code. Since that section of code is a method of CffFusion, you could probably create a new attribute for that flag. However, you'll also need to edit downstream to check for that flag. Alternatively, you could edit the gene name to include a unique suffix.
EDIT:
I think a better way to phrase this:
When a gene_name/symbol from NCBI in the gene info does not belong to the gene bed file (ensembl version outdated so synonyms changed/areas remapped/discovered,etc), a warning is spit out: https://github.com/ccmbioinfo/MetaFusion/blob/81df5123ffca922f3c35f0639c640c14badde40e/scripts/pygeneann_MetaFusion.py#L1792-L1795
What happens to the gene in this instance? We are seeing a lot of NoHeadGene/truncated annotations. Gene order will probably not change from the caller in these cases... (since score seems to be dependent on the annotation in the gene bed file) Is anything else effected?