joined gene names, a possible pitfall to cause incorrect result?

biocyberman commented 8 years ago

Is chanjo aware of this problematic gene names, which may causes various problems for queries that base on gene names?

➤ gawk '{print $NF}' ccds.15.grch37p13.extended.bed|grep ','|head                                                                                                                                                                                 
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
NOX1,NOX1,NOX1
➤ gawk '{print $NF}' ccds.15.grch37p13.extended.bed|grep ','|wc -l                                                                                                                                                                                
66188

➤ gawk '{print $NF}' ccds.15.grch37p13.extended.bed|grep ','|sort|uniq|wc -l                                                                                                                                                                      
9290
➤ gawk '{print $NF}' ccds.15.grch37p13.extended.bed|grep ','|sort|uniq >problematic.gene.names.txt

biocyberman commented 8 years ago

An test query on NOX1 returned a result. So I guess chanjo does indeed take care of the problem. Could you @robinandeer explain how it does that? Maybe point me to the relevant code section is enough.

robinandeer commented 8 years ago

I'm not quiet sure what you mean :/

The only problematic gene names I know on are the ones that exist on both the X and Y chromosomes and have to be given prefixes.

It looks like you are picking out exons that belong to multiple transcripts which all map to the same gene but the input looks correct :)

Remember that it's only in the loading step these colums matter - for annotations, only the chrom, start, end columns matter

Clinical-Genomics / chanjo

joined gene names, a possible pitfall to cause incorrect result? #178