Closed elhb closed 6 years ago
Yes, I think that should be included and it's definitely easier,
still I think the method in this pull request says something about where the problem comes from (e.g. the one I encountered with the gencode annotation).
Should I implement the "dict of added genes" as well as the method in this pull request or should we go for that alone?
What I would do is to first check if there are duplicated genes in the input matrix and if so warm the user and keep a list of the duplicated genes (Very easy to do with a Counter() object). Then we need a set() of "genes replaced", we check every time a gene is about to be replaced if the new gene is present in the genes already replaced set and if so we also check the list of duplicated genes and output the necessary warming messages. This should deal with duplicated genes in the input matrix and duplicated entries in the annotation file and give all the necessary information to the user. Duplicated genes in the input matrix are kept of course but when I duplicated entry occurs we can either keep the original annotation (the one that comes in the matrix) or the duplicated one. I suggest we keep the original annotation in that case, what do you think? A simple find and replace would help for this.
Right, what do you think about this? (see last three commits)
Great!
In some cases two (or more) ensembl ids in an annotation file map to the same gene name.
If both the ensembl ids are present in the st data file it will lead to duplicated gene names after convertion.
This version of scripts/convertEnsemblToNames.py generates a warning when that happens.