boutroslab / CRISPRAnalyzeR

CRISPRAnalyzeR: interactive analysis, annotation and documentation of pooled CRISPR screens
GNU General Public License v2.0
80 stars 31 forks source link

Genes Missing From Pre-made GeCKO v2 FASTA #10

Closed DarioS closed 7 years ago

DarioS commented 7 years ago

All guide RNAs for some genes are missing from the pre-made FASTA file. For example, CD99 has two duplicate entries in the GeCKO design:

 gene_id          UID                  seq
  CD99_X HGLibA_08488 ATACTCACCAGGAAGGGCAT
  CD99_X HGLibA_08489 GATTTATCCGATGCCCTTCC
  CD99_X HGLibA_08490 CTCACCAGCACTGGGTTTCT
  CD99_Y HGLibA_08491 ATACTCACCAGGAAGGGCAT
  CD99_Y HGLibA_08492 GATTTATCCGATGCCCTTCC
  CD99_Y HGLibA_08493 CTCACCAGCACTGGGTTTCT

Notice that every guide RNA for CD99 is duplicated. CRISPRAnalyzeR removes any guide RNAs that map to two or more genes, causing CD99 to be automatically eliminated from the analysis. This filtering choice causes numerous genes to be represented by 0 guides, some which are simply suffixed with _X and _Y. Perhaps being less conservative by copying the total counts of a guide to each gene symbol it is related to would not miss any important genes and leave the responsibility to the biologist to resolve gene families or weird design choices like the example shown ?

Might this also cause problems with identifier conversion, because such symbols aren't standard HGNC symbols and because the default delimiter used for splitting the sequence name is an underscore character?

jwinter6 commented 7 years ago

HI Dario,

I'll have a look into it and come back to you later.

Best Jan

DarioS commented 7 years ago

Now I see that these kinds of symbols have been reformatted as e.g. CD99-X and CD99-Y in the FASTA file downloadable from the server. So, the underscores in the source file would not be a problem. Nonetheless, it's probably still an issue for counting having duplicate guide sequences and the gene IDs are not official HGNC symbols which could cause an issue for GSEA and other such analyses.

jwinter6 commented 7 years ago

hi Dario,

the new files will be up and ready soon. Thanks for your help.

Best Jan

jwinter6 commented 7 years ago

added individual files e531f7161a495f01ced178069b0673a5a3525f63

DarioS commented 7 years ago

The conversion of gene symbols in the pseudoautosomal region seems to have mostly worked well, e.g. ASMT_X, ASMT_Y are now simply ASMT which is good for GSEA, but I noticed CD99 somehow has the chromosome symbol prefixed to the guide's nucleotide sequence. I'm not sure why it's different to the other pseudoautosomal region guides.

>CD99_XATCCCCAAGAAACCCAGTGC
xatccccaagaaacccagtgc
jwinter6 commented 7 years ago

Hi dario,

I am sorry there was a mistake, the x should only be in the identifier, not in the sequence :) thanks for the hint, I fixed this in e21eecb9c26da6e95d9fff34109942cbaf927c3c

Best Jan