Alan-Collins / CRISPR_comparison_toolkit

Tools to analyze the differences and similarities between CRISPR arrays
GNU General Public License v3.0
8 stars 2 forks source link

blast and spacerblast #3

Closed ntromas closed 1 year ago

ntromas commented 1 year ago

Hi Alan,

First of all, thanks for such nice pipeline! I tried the blast and spacerblast and I got 2 different issues (that seem to be associated with blast but this is my guess). I installed cctk with conda. I tried with blastn: 2.12.0+ and the initial version installed with conda (blastn: 1.5.0).

For cctk blast, the program ran but no outputs are generated. Not sure what is happening...

For spacerblast, I used the following command (I modified the unique header to remove the special char and then use makeblastdb). cctk spacerblast -d genomad_out_Champ/virus_db -s Spacers_nored.fna -o spacer_virus.txt -t 8 ERROR running blastdbcmd on genomad_out_Champ/virus_db : k141_19678925 2983-3018 plus k141_14048783 1561-1596 plus k141_15935713 689-724 plus k141_27198690 4927-4962 plus k141_97260130 16071-16106 plus k141_25286444 6929-6964 plus k141_4415261 854-889 plus k141_49836241 985-1020 plus k141_85792883 2823-2858 plus k141_17594327 2761-2796 plus k141_83872733 4683-4718 plus k141_21295442 1117-1152 ... Then nothing happened....

If you have any idea, would be awesome!

Cheers,

Nico

Alan-Collins commented 1 year ago

Hi Nico,

Can you confirm which version of cctk you have from conda? I just pushed a fix for a spacerblast bug a few days ago that I think might solve the issue you are having (version 1.0.1).

For the cctk blast issue, no files are written when no spacers are found using the provided CRISPR repeats. In that case you would see the something like the following message printed to stdout No arrays identified from the query file <your_repeats.fna> in the database <your_db>

Can you confirm that there are definitely CRISPR arrays present in the sequences in your database? Does cctk minced find arrays? If so, are you able to share the repeats you are using and a couple of the sequences you are running it on with me so that I can take a look?

Thanks!

ntromas commented 1 year ago

Hi Alan,

You're right I have the 1.0.0 version from conda...Is there a way to upgrade it with conda? (by default, the 1.0.0 version was installed).

From the blast, the program is not doing anything and seem to be still running (>10h). I just got the following message: cctk blast -d seq_colony_Micro -r DR_nored.fa -o CCTK_blast_out

WARNING!!!

In order to associate arrays found in different contigs from the same assembly, you must provide either a list of assembly names or a regex to identify your assemblies. Each sequence in your blastdb will be treated as a separate assembly.

With minced, it is working and the DR I am using were also found with another CRISPR identifier...

Sure, I can share with you that!

Alan-Collins commented 1 year ago

You can update the version installed with conda using the command conda update -c bioconda cctk. You can then confirm the installed version with conda list cctk. Hopefully that fixes the spacerblast issue.

Hmm. cctk blast can take a few minutes if you have a lot of sequences, but it definitely shouldn't be taking hours! How large is the sequence database you are searching with cctk blast?

If it's feasible, it would be most helpful for you to share the whole thing. If it's too big then I'll hopefully be able to reproduce the issue using a subset. You can attach the files here or email them to me at crisprtoolkit@gmail.com. I should be able to take a look this evening and hopefully figure out the problem!

Thanks!

Alan-Collins commented 1 year ago

Thanks for sharing the files. I think I have fixed the issue. Basically, I had never tested cctk blast with such an enormous dataset as you are using (805,275 unique spacers in 6,518 distinct arrays!!!). There was a function that was iterating over every array to find a list of all the unique spacers. I hadn't noticed the slow code with the size of dataset I used. I changed it to use set intersection and now it finishes in about 90 seconds on 8 threads on my personal computer.

The other change I needed to make is to skip making a network of array relationships when there are too many arrays. Making a network necessarily involves doing all pairwise comparisons and with the number of arrays you have that will take forever! cctk minced and cctk blast now skips making a network if you have over 1000 arrays, but you can still generate one using cctk network if you like.

Would you mind testing the changes on your system to confirm that this fixes the problem? You can get the modified code from the dev branch on the github. You should be able to clone the dev branch with git clone -b dev https://github.com/Alan-Collins/CRISPR_comparison_toolkit.git then run the cctk script in the clone repo while inside your cctk conda environment (i.e., path/to/cctk blast <options>)

Alan-Collins commented 1 year ago

Fixed in 1.0.2.