Alan-Collins / CRISPR_comparison_toolkit

Tools to analyze the differences and similarities between CRISPR arrays
GNU General Public License v3.0
8 stars 2 forks source link

IndexError with spacerblast #4

Open almita opened 2 months ago

almita commented 2 months ago

Hi, I'm running cctk from this container using the 1.0.2--pyhdfd78af_0 version. I ran minced without issue and I'm trying to blast the spacer sequences to a set of contigs (I've already run the makeblastdb command on them successfully). However when I run spacerblast in the container:

cctk spacerblast -d blastdb -s CRISPR_spacers.fna -o spacerblast.txt -t 32

I get the following error:

Traceback (most recent call last):
  File "/usr/local/bin/cctk", line 190, in <module>
    main()
  File "/usr/local/bin/cctk", line 181, in main
    spacerblast.main(args)
  File "/usr/local/lib/python3.11/site-packages/cctkpkg/spacerblast.py", line 707, in main
    p = protos[p_count]
        ~~~~~~^^^^^^^^^
IndexError: list index out of range

I tried specifying -p 90 because I thought maybe that was missing but I get the same error.

Alan-Collins commented 2 months ago

Hi,

Would you be able to share the data that are causing the error? If you can share a sample that has a problem I will take a look and see if I can reproduce the issue.

Thanks!

almita commented 2 months ago

I tried running it again on a smaller scale (the first 1,000 spacers and a database of the first 100 contigs) and there was no error. I was running it originally with 31,959 spacers and a database of 4040 contigs, so I'm not sure if the amount of data is related to the error?

Alan-Collins commented 2 months ago

Hmmm... It certainly could have something to do with the number of spacers and size of database. I tried testing on the largest database I could put together locally, which is 14,000 spacers and 1350 assemblies (each around 6Mbp). I ran with the same command you described, but didn't see any error. That returned 3,306,444 protospacers, so a pretty big dataset.

Without being able to reproduce the issue there's not much I can do. Perhaps you can try splitting your dataset into batches of spacers to see if you can identify a subset of spacers that cause the issue? If it is a spacer-related issue then I will be happy to look into it more.

almita commented 2 months ago

I split the spacers in 2 and the first half (15,979 spacers) ran fine, the second half (15,980 spacers) gave me an error. I split that second half into 2 and both halves (7,990 spacers each) gave me the error. I used the full database of 4040 assemblies for all runs.

Alan-Collins commented 2 months ago

Interesting! If you can split it down to a manageable size that you are able to share with me then I am happy to track down the issue. Would you be willing to share some data with me?

Would you mind splitting the assembly db in quarters (or further if you're willing to)? That should be a more manageable. We can use this service to share the data (again, if you are willing). https://www.swisstransfer.com/en-us

almita commented 2 months ago

Sure, I split it down to 1500 spacers and 1010 assemblies, is that manageable?

Alan-Collins commented 2 months ago

That should be good. Thank you!

almita commented 2 months ago

Here are the files, its the fasta file with all the assemblies and another fasta for the spacers: https://www.swisstransfer.com/d/6dccf627-1701-4b6d-8e74-28a6cb300df3