MPUSP / snakemake-crispr-guides

A Snakemake workflow for the design of small guide RNAs (sgRNAs) for CRISPR applications.
MIT License
3 stars 2 forks source link

When using assembly: "GCF_005577135.1", species name is not parsed correctly, which is causing failure in create_bsgenome.R and design_guides.R #19

Closed ute-hoffmann closed 8 months ago

ute-hoffmann commented 10 months ago

When using "GCF_005577135.1", species name is parsed as "Picosynechococcus sp. PCC 11901 chromosome". The correct and valid name for the organism's name would be “Synechococcus sp. PCC 11901” or "Picosynechococcus sp. PCC 11901" (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2579791). This causes (in create_bsgenome.R)

txdb <- quiet_txdb( file = genome_gff, organism = genome_name, chrominfo = seqinfo_genome )$result

to break, which can be fixed by hardcoding the taxonomyId:

txdb <- quiet_txdb( file = genome_gff, organism = genome_name, chrominfo = seqinfo_genome, taxonomyId = 2579791 )$result

The same happens in design_guides.R in:

txdb <- makeTxDbFromGFF( file = genome_gff, organism = unname(genome(seqinfo_genome)[1]), chrominfo = seqinfo_genome )

ute-hoffmann commented 10 months ago

Checked again and "Picosynechococcus sp. PCC 11901" does not seem to be a valid species name either, so "Synechococcus sp. PCC 11901" would have to be entered instead, which is not part of the fasta headers or given in the .gff file which were downloaded. A possible solution would be to extract the species field given in the gff file - even though I am not sure if all gffs contain this field (in case of Synechococcus 11901 the following line: ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2579791)

m-jahn commented 9 months ago

good find! will try to reproduce this and see how to fix. Ideally there is an automatic solution such that the user does not need to specify the species name manually

ute-hoffmann commented 9 months ago

When checking, it also did not become obvious to me if the downloaded meta data is anyways of importance for downstream analyses etc. If it is not, a possible solution might be to just simply give some mock species ID or species name and ignore the meta data.

m-jahn commented 9 months ago

@ute-hoffmann check out dev branch, where this is fixed now.