ccmbioinfo / MetaFusion

GNU Lesser General Public License v3.0
8 stars 2 forks source link

Gene bed and gene info #12

Open pintoa1-mskcc opened 1 year ago

pintoa1-mskcc commented 1 year ago

EDIT:

I think a better way to phrase this:

When a gene_name/symbol from NCBI in the gene info does not belong to the gene bed file (ensembl version outdated so synonyms changed/areas remapped/discovered,etc), a warning is spit out: https://github.com/ccmbioinfo/MetaFusion/blob/81df5123ffca922f3c35f0639c640c14badde40e/scripts/pygeneann_MetaFusion.py#L1792-L1795

What happens to the gene in this instance? We are seeing a lot of NoHeadGene/truncated annotations. Gene order will probably not change from the caller in these cases... (since score seems to be dependent on the annotation in the gene bed file) Is anything else effected?

pintoa1-mskcc commented 1 year ago

Hi again,

I've been investigating this issue and I see that when a gene_name does not exist in the gene_bed file, something does wrong with the gene order algorithm which assigns the fusion category/type. Do you all have any suggestions on how to find the issue?

mike8115 commented 1 year ago

Hey there,

Sorry for going dark for so long. This completely fell off my radar.

From the main Metafusion.sh script, it looks like gene_bed is used twice.

https://github.com/ccmbioinfo/MetaFusion/blob/81df5123ffca922f3c35f0639c640c14badde40e/scripts/MetaFusion.sh#L96-L102

https://github.com/ccmbioinfo/MetaFusion/blob/81df5123ffca922f3c35f0639c640c14badde40e/scripts/MetaFusion.sh#L113-L116

That said, it's possible that the error is coming from downstream if there's a problem when reading the transformed data. Perhaps there's an uncaught case that's producing a default value.

Do you happen to know which output file is the first one affected? That might help us pinpoint which function/script needs to be looked at.

Cheers, Michael

pintoa1-mskcc commented 1 year ago

The part that I saw having the effect is during reann_cff_fusion.py. I think this is because of faulty logic during an if else statement here: https://github.com/ccmbioinfo/MetaFusion/blob/81df5123ffca922f3c35f0639c640c14badde40e/scripts/pygeneann_MetaFusion.py#L874-L888

Particularly: if score1 == max_t2[0] and gname1==self.t_gene1: During this step, if self.t_gene1 does not exist in the list of genes you are iterating over (genes1), then you will only ever change the gene order when the score increases, but if the score is always the lowest possible something seems to fall out...

Also I just noticed its an if -> if statement for score1, but elsif for score 2

mike8115 commented 1 year ago

Ah, I remember asking one of the authors about Line 877. Unfortunately, I never got a satisfactory response about that. The equality comparison between score1 and max_t2 looks like typo to me. As you mentioned previously, it would make more sense to compare to max_t1.

Also I just noticed its an if -> if statement for score1, but elsif for score 2

My guess is that there were multiple contributors and someone noticed the score1 == max_t2[0] expression. Using two separate ifs makes (slightly) more sense than an else-if... however, neither of them really makes a lot of sense to begin with. If we correct line 877, then both for-loops should work the same because the two if-conditions are mutually exclusive.

if self.t_gene1 does not exist in the list of genes you are iterating over (genes1), then you will only ever change the gene order when the score increases, but if the score is always the lowest possible something seems to fall out

You'd be correct in that regard. If all the scores are 0 and t_gene1 is not in the bed file, then it looks like the self.reann_gene1 becomes "". Perhaps there could be a check to prevent this, but I'm not sure what would be a good fallback value.

pintoa1-mskcc commented 1 year ago

Upon further analysis, I found that the issue I was facing was actually a dual issue:

A flag for when these things occurs to tell the analyst that the fusion may not be the correct annotation in these cases would be very helpful.

pintoa1-mskcc commented 1 year ago

Additionally I think and gname1==self.t_gene1: this is unnecessary because earlier the tool filtered the input gene list to select the gene_name that is self.t_gene1 if self.t_gene1 is in the gene list.

mike8115 commented 1 year ago

A flag for when these things occurs to tell the analyst that the fusion may not be the correct annotation in these cases would be very helpful

I think that should be possible if you edit the code. Since that section of code is a method of CffFusion, you could probably create a new attribute for that flag. However, you'll also need to edit downstream to check for that flag. Alternatively, you could edit the gene name to include a unique suffix.