dcouvin / CRISPRCasFinder

A Perl script allowing to identify CRISPR arrays and associated Cas proteins from DNA sequences
https://crisprcas.i2bc.paris-saclay.fr
GNU General Public License v3.0
80 stars 28 forks source link

a particular set of parameters for CCF fails to produce results for all genomes we tried #54

Open azat-badretdin opened 4 months ago

azat-badretdin commented 4 months ago

We have discovered recently that a particular set of parameters (quite sensible, in our opinion) for CCF fails to produce any CRISPR arrays for all genomes we tried.

Here is the command line we used

        perl $installation_root/CRISPRCasFinder.pl \
            -in  "$fasta" \
            -cas -keep \
            -drpt  $installation_root/supplementary_files/repeatDirection.tsv \
            -rpts $installation_root/supplementary_files/Repeat_List.csv \
            -ccvr -dbc $installation_root/supplementary_files/CRISPR_crisprdb.csv \
            -levelMin 3 \
            -mismDRs 20 \
            -minNbSpacers 4 \
            -noMism \
            -truncDR 20 \
            -spSim 10 

without additional numeric parameters and -noMism flag it produces a sensible array for a sample input that I will attach later:

      "Crisprs": [
        {
          "Name": "NZ_BJDJ01000033_1",
          "Start": 1396,
          "End": 2620,
          "DR_Consensus": "GGGTTTAACCTTATTGATTTAACATCCTTCTAAAAC",
          "Repeat_ID": "R5170",
          "DR_Length": 36,
          "Spacers": 18,
          "Potential_Orientation": "-",
          "CRISPRDirection": "-",
          "Evidence_Level": 4,
          "Conservation_DRs": 99.1736881974464,
          "Conservation_Spacers": 0,
          "Regions": [
            {

as you can see output fits the input parameters. I am not sure how truncDr parameter is used since the corresponding metrics is not reported back in JSON output.

azat-badretdin commented 4 months ago

Here is the sample input:

fasta.fna.txt

azat-badretdin commented 4 months ago

For the record: this corresponds to our internal ticket PGAP-9526

azat-badretdin commented 4 months ago

I narrowed this down to requirement -spSim 10

The code that handles it is here:

sub checkspacersAlignMuscle
{
    my($file) = @_;
    my $ident;
    eval
    {
        my $resultAlign = fastaAlignmentMuscleOne($file);

        my $str = Bio::AlignIO->new(-file => $resultAlign);
        my $aln = $str->next_aln();

        $ident = $aln->percentage_identity;
    };

    if($@)
    {
        # An error occurred...
        $ident = 100;
    }

    if($ident >= $SpSim) {return 0;} else {return 1;} # DC - replaced 60 by $SpSim

}

When muscle is called to produce the input alignment for this, it all depends on what muscle is used. Standard muscle that we have produces very gappy alignment, as expected, while muscle that is in the path after I activate conda for CCF produces alignment almost without any gaps:


>2
ATTTTCTTCAATGT-TCTATCA-GGATATACT
>7
CAAACTAGCGGAAG-CAAACAA-GCGCAAGCG
>6
ATTGAGTATGG-TG-GTCATTT-TGAACCAGA
>5
TTCCATAATGGGTT-GGAGACA-AGTAAAGCC
>17
ATAGTCTTGTTGTA-TCTGCTT-AGCGACACT
>15
ATACATCAACCTCA-ATCATCG-AACGTGTTA
>13
GCATAAACATTCCC-ACTAGTA-TCTACTGCT
>3
TTACTCTGCAATCC-GGCAGTA-AAGTTGGTT
>1
TAAGAGACTAGGTG-GCATGGT-TTACGCTGA
>4
AGTTAGGCAAGCTT-ATCCAGG-CACTAGCAT
>12
CGTCGCCACCATTT-GAGTGCA-TGTTCACGT
>16
GGATTCACGGTTGC-AGGCGAA-TCAACACCG
>9
TATCTCGAAAGGGT-ATACGCACGTTATCACT
>11
TAAGTGATATAAAT-AAACATG-TAATCGTTC
>18
TTCATGAATCCAAA-ATTCCCA-AAATTTTGA
>14
AAGGTGGCGACTCG-TTTAAGG-TTCTTATCC
>8
CCTTTTTCATATCTAAATATTG-TAGACCTAG
>10

for the array specified in Issue text. That does not look right. Such artificially aligned input will produce, just randomly identity match about 25% which would be higher than requested 10% max.

I think that this is a crux of the problem here. Something is not right with "CCF-muscle" alignment of spacers.

Note that for "Conservation_Spacers" item in the JSON output a different subroutine is called:

sub sequenceAlignmentMuscle
{
  my($file) = @_;
  my $ident = -1;
  eval
  {
        my $str = Bio::AlignIO->new(-file => $file);
        my $aln = $str->next_aln();

        $ident = $aln->overall_percentage_identity; #overall_percentage_identity;

  };

  if($@)
  {
        # An error occurred...
        $ident = -1;
  }

  return $ident;
}

notice the difference in setting up identities. In one case identity is set to 100 and in this case it is -1. The fact that the output Conservation_Spacers 0 in this case means that we simply never get to this call and it remains set to original value (I do not remember how PERL leaves the original values of uninitialized variables).

Anyway, even if I am wrong here in the cause, it is clear that " -spSim 10" parameter is broken.