Closed DarioS closed 7 years ago
HI Dario,
I'll have a look into it and come back to you later.
Best Jan
Now I see that these kinds of symbols have been reformatted as e.g. CD99-X and CD99-Y in the FASTA file downloadable from the server. So, the underscores in the source file would not be a problem. Nonetheless, it's probably still an issue for counting having duplicate guide sequences and the gene IDs are not official HGNC symbols which could cause an issue for GSEA and other such analyses.
hi Dario,
the new files will be up and ready soon. Thanks for your help.
Best Jan
added individual files e531f7161a495f01ced178069b0673a5a3525f63
The conversion of gene symbols in the pseudoautosomal region seems to have mostly worked well, e.g. ASMT_X, ASMT_Y are now simply ASMT which is good for GSEA, but I noticed CD99 somehow has the chromosome symbol prefixed to the guide's nucleotide sequence. I'm not sure why it's different to the other pseudoautosomal region guides.
>CD99_XATCCCCAAGAAACCCAGTGC
xatccccaagaaacccagtgc
Hi dario,
I am sorry there was a mistake, the x should only be in the identifier, not in the sequence :) thanks for the hint, I fixed this in e21eecb9c26da6e95d9fff34109942cbaf927c3c
Best Jan
All guide RNAs for some genes are missing from the pre-made FASTA file. For example, CD99 has two duplicate entries in the GeCKO design:
Notice that every guide RNA for CD99 is duplicated. CRISPRAnalyzeR removes any guide RNAs that map to two or more genes, causing CD99 to be automatically eliminated from the analysis. This filtering choice causes numerous genes to be represented by 0 guides, some which are simply suffixed with _X and _Y. Perhaps being less conservative by copying the total counts of a guide to each gene symbol it is related to would not miss any important genes and leave the responsibility to the biologist to resolve gene families or weird design choices like the example shown ?
Might this also cause problems with identifier conversion, because such symbols aren't standard HGNC symbols and because the default delimiter used for splitting the sequence name is an underscore character?