boutroslab / CRISPRAnalyzeR

CRISPRAnalyzeR: interactive analysis, annotation and documentation of pooled CRISPR screens
GNU General Public License v2.0
80 stars 33 forks source link

regex underscore bug in script which produces FASTA_GeckoV2_all.fasta #52

Open stachyra opened 5 years ago

stachyra commented 5 years ago

Hello,

I think that the downloadable file FASTA_GeckoV2_all.fasta contains a regex-related error, repeated 12 times.

In the original .csv files containing the GeCKOv2 sequences (downloadable from www.addgene.org), there are two gene identifiers which contain an underscore in the gene identifier code itself: "CD99_X" and "CD99_Y". Whatever conversion script you are using to transform from these .csv source files to .fasta format probably contains a regex which is not handling these underscores correctly, because the resulting lines in the fasta file look like this:

CD99_XATACTCACCAGGAAGGGCAT xatactcaccaggaagggcat CD99_XGATTTATCCGATGCCCTTCC xgatttatccgatgcccttcc CD99_XCTCACCAGCACTGGGTTTCT xctcaccagcactgggtttct CD99_YATACTCACCAGGAAGGGCAT yatactcaccaggaagggcat CD99_YGATTTATCCGATGCCCTTCC ygatttatccgatgcccttcc CD99_YCTCACCAGCACTGGGTTTCT yctcaccagcactgggtttct CD99_XATCCCCAAGAAACCCAGTGC xatccccaagaaacccagtgc CD99_XAGACTCTTACCGGAGGAACT xagactcttaccggaggaact CD99_XTTAGGGGATGACTTTGACTT xttaggggatgactttgactt CD99_YATCCCCAAGAAACCCAGTGC yatccccaagaaacccagtgc CD99_YAGACTCTTACCGGAGGAACT yagactcttaccggaggaact CD99_YTTAGGGGATGACTTTGACTT yttaggggatgactttgactt

I.e., the label line fails to include an a second underscore between the gene id (CD99_X / CD99_Y) and the 20 bp gRNA sequence, and furthermore the sequence line itself contains an additional prepended "x" or "y".

Best regards, Andrew L. Stachyra