WGLab / doc-ANNOVAR

Documentation for the ANNOVAR software
http://annovar.openbioinformatics.org
234 stars 359 forks source link

Incorporating a custom db to annovar annotation (CADD-SpliceAI) #127

Closed kvn95ss closed 3 years ago

kvn95ss commented 3 years ago

Hello, hope you are doing well.

We want to incorporate CADD-SpliceAI to our annovar annotation (link - https://cadd.gs.washington.edu/download). The precomputed scores have been provided in TSV format.

As convert2annovar doesn't yet support tsv file format, what would be the best way to make the data into annovar compatible db? Would there be any issues if I just replicated the Pos column to make the file annovar compatible, i.e in format of Chrom, Start, End, Ref, Alt, info

The header of CADD file is as follows -

#Chrom  Pos Ref Alt Type    Length  AnnoType    Consequence etc etc

And with simple cut command, the file can be made into the following format -

#Chrom  Pos Pos(end)    Ref Alt Type    Length  AnnoType    Consequence etc etc

I would like to confirm before testing it myself, as the CADD file is quite huge.

kaichop commented 3 years ago

This is correct, you can use index_annovar to create index so that the search can be speed up.

On Mon, Mar 1, 2021 at 1:32 AM Karthik Nair notifications@github.com wrote:

Hello, hope you are doing well.

We want to incorporate CADD-SpliceAI to our annovar annotation (link - https://cadd.gs.washington.edu/download). The precomputed scores have been provided in TSV format.

As convert2annovar doesn't yet support tsv file format, what would be the best way to make the data into annovar compatible db? Would there be any issues if I just replicated the Pos column to make the file annovar compatible, i.e in format of Chrom, Start, End, Ref, Alt, info

The header of CADD file is as follows -

Chrom Pos Ref Alt Type Length AnnoType Consequence etc etc

And with simple cut command, the file can be made into the following format -

Chrom Pos Pos(end) Ref Alt Type Length AnnoType Consequence etc etc

I would like to confirm before testing it myself, as the CADD file is quite huge.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OAOPRJLPT6PHGSR5TTTBMYHXANCNFSM4YL7UXOQ .

kvn95ss commented 3 years ago

I found a script online to index the db, is this the same index_annovar which you are referring to?

#       HOW TO USE
#       perl index.pl inputfile.txt 1000 > inputfile.txt.idx
$N=0;
$bin=$ARGV[1];
my %idxMAX=();
my %idxMIN=();
open I,"<$ARGV[0]";
$KEY='XXX';
$idxMAX{'XXX'}=0;
while(<I>){
#       chomp;
        ($a,$b)=(split/\t/)[0,1];
        $c=$bin*int($b/$bin);
        $length=length($_);
        $N+=$length;
        if($KEY eq "$a\t$c"){
                $idxMAX{$KEY}=$N;
        }else{
                $idxMIN{"$a\t$c"}=$idxMAX{$KEY};
                $KEY="$a\t$c";
                $idxMAX{$KEY}=$N;
        }
}
close I;

$size=`ls -l $ARGV[0]|awk \'{print \$5}\'`;
chomp($size);

print "#BIN\t$bin\t$size\n";
foreach(sort keys %idxMIN){
        if($_ eq 'XXX'){next;}
        if($_=~m/\#/){next;}
        print "$_\t$idxMIN{$_}\t$idxMAX{$_}\n";
}

If not please let me know where I can find the script, I can't seem to find it in the annovar folder.

kaichop commented 3 years ago

I attached it. It might be the same.

On Mon, Mar 1, 2021 at 11:56 PM Karthik Nair notifications@github.com wrote:

I found a script online to index the db, is this the same index_annovar which you are referring to?

HOW TO USE

perl index.pl inputfile.txt 1000 > inputfile.txt.idx

$N=0; $bin=$ARGV[1]; my %idxMAX=(); my %idxMIN=(); open I,"<$ARGV[0]"; $KEY='XXX'; $idxMAX{'XXX'}=0; while(){

chomp;

    ($a,$b)=(split/\t/)[0,1];
    $c=$bin*int($b/$bin);
    $length=length($_);
    $N+=$length;
    if($KEY eq "$a\t$c"){
            $idxMAX{$KEY}=$N;
    }else{
            $idxMIN{"$a\t$c"}=$idxMAX{$KEY};
            $KEY="$a\t$c";
            $idxMAX{$KEY}=$N;
    }

} close I;

$size=ls -l $ARGV[0]|awk \'{print \$5}\'; chomp($size);

print "#BIN\t$bin\t$size\n"; foreach(sort keys %idxMIN){ if($ eq 'XXX'){next;} if($=~m/#/){next;} print "$\t$idxMIN{$}\t$idxMAX{$_}\n"; }

If not please let me know where I can find the script, I can't seem to find it in the annovar folder.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/127#issuecomment-788584989, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OEDVVDW3ARGJ5IWFVLTBRVYHANCNFSM4YL7UXOQ .

!/usr/bin/env perl

use warnings; use strict; use Pod::Usage; use Getopt::Long;

our $VERSION = '$Revision: b2f6d23398f06abe010d01f970d58d96a739009a $'; our $LAST_CHANGED_DATE = '$LastChangedDate: 2012-10-23 23:32:05 -0700 (Tue, 23 Oct 2012) $';

our ($verbose, $help, $man); our ($dbfile); our ($filetype, $bin, $outfile, $skipsort, $commentfile);

GetOptions('verbose|v'=>\$verbose, 'help|h'=>\$help, 'man|m'=>\$man, 'filetype=s'=>\$filetype, 'bin=i'=>\$bin, 'outfile=s'=>\$outfile, 'skipsort'=>\$skipsort, 'commentfile=s'=>\$commentfile) or pod2usage ();

$help and pod2usage (-verbose=>1, -exitval=>1, -output=>*STDOUT); $man and pod2usage (-verbose=>2, -exitval=>1, -output=>*STDOUT); @ARGV or pod2usage (-verbose=>0, -exitval=>1, -output=>*STDOUT); @ARGV == 1 or pod2usage ("Syntax error");

($dbfile) = @ARGV;

$filetype ||= 'A'; $filetype =~ m/^[ABC]$/ or pod2usage ("Error in argument: the -filetype argument can be only 'A' or 'B' or 'C'"); $bin ||= 1000; $outfile ||= "$dbfile.newdb";

print STDERR "NOTICE: the bin size is set as $bin (use -bin to change this)\n"; print STDERR "NOTICE: Two output files will be generated for use by ANNOVAR: $outfile and $outfile.idx (use -outfile to override)\n";

if (not $skipsort) {

step 1: generate the new output file

print STDERR "NOTICE: Running the first step of indexing (generating $outfile) ...\n";

if ($dbfile eq $outfile) {
    die "Error: your -outfile is identical to input file. Use -skipsort if you are sure that inputfile is sorted\n";
}

my $command;
#$command = "echo -n > $outfile";   #create a new empty file
#system ($command);

if (defined $commentfile) {
    $command = qq{grep -P '^#' $commentfile > $outfile};        #keep the comment lines in the output file
    system ($command);
    print STDERR "NOTICE: Adding comments from commentfile by <$command>\n";
} else {
    $command = qq{grep -P '^#' $dbfile > $outfile};     #keep the comment lines in the output file
    system ($command);
}

for my $i (1 .. 22, 'X', 'Y', 'M', 'MT') {  
    if ($filetype eq 'A') {
        $command = qq#grep -P '^(chr)?$i\\t\\d+' $dbfile | sort -n -k 2 >> $outfile#;
    } elsif ($filetype eq 'B') {
        $command = qq#grep -P '^\\w+\\t(chr)?$i\\t\\d+' $dbfile | sort -n -k 3 >> $outfile#;
    } elsif ($filetype eq 'C') {
        $command = qq#grep -P '^\\w+\\t\\w+\\t(chr)?$i\\t\\w+\\t\\d+' $dbfile | sort -n -k 5 >> $outfile#;
    }
    $verbose and print STDERR "NOTICE: Running command: $command\n";
    system ($command);
}

} else { my $command;

step 1: generate the new output file

if ($dbfile ne $outfile) {
    if ($commentfile) {
        print STDERR "NOTICE: Running the first step of indexing (combining $commentfile and $dbfile to generate $outfile) ...\n";
        $command = qq{grep -P '^#' $commentfile > $outfile};        #keep the comment lines in the output file
        print STDERR "NOTICE: Running <$command>\n";
        system ($command);
        $command = qq{grep -v -P '^#' $dbfile >> $outfile};     #keep the comment lines in the output file
        print STDERR "NOTICE: Running <$command>\n";
        system ($command);
    } else {
        print STDERR "NOTICE: Running the first step of indexing (copying $dbfile to $outfile) ...\n";
        system ("cp $dbfile $outfile") and die "Error: cannot run system command 'cp $dbfile $outfile'\n";
    }
} else {
    if (defined $commentfile) {
        pod2usage ("Error in argument: -outfile must be different from input file when --commentfile is specified");
    }
}

}

step 2: generate the index file

print STDERR "NOTICE: Running the second step of indexing (generating $outfile.idx) ...\n"; $dbfile = $outfile; #now the dbfile is the newdb generated in step 1 my $filesize = -s $dbfile; my %region = (); my ($offset, $lastregion, $firstoffset, $firstline) = (0, undef, 0, undef);

open (DB, $dbfile) or die "Error: cannot read from dbfile $dbfile: $!\n"; open (IDX, ">$dbfile.idx") or die "Error: cannot write to index file $dbfile.idx: $!\n";

print IDX "#BIN\t$bin\t$filesize\n";

while () { my ($chr, $start); my $length = length ($_); s/[\r\n]+$//;

if (m/^#/) {        #comment line is skipped
    $offset += $length;
    next;
}

if ($filetype eq 'A') {
    ($chr, $start) = split (/\t/, $_);
} elsif ($filetype eq 'B') {
    (undef, $chr, $start) = split (/\t/, $_);
    $start++;       #UCSC use zero-start
} elsif ($filetype eq 'C') {
    (undef, undef, $chr, undef, $start) = split (/\t/, $_);
    $start++;
}
defined $start or die "Error: unable to find start site from input line <$_>\n";
if (not $start =~ m/^\d+$/) {
    warn "Error: the start site ($start) is not a positive integer in input line <$_> (trying to convert to integer...)\n";
    $start = int ($start);
    $start > 0 or next;
}

my $curbin = $start - ( $start % $bin );
my $region = "$chr\t$curbin";
$region{ $region }{ 'min' }   = $offset unless defined( $region{ $region }{ 'min' } );
$region{ $region }{ 'max' }   = $offset + $length;

$offset =~ m/000$/ and print STDERR sprintf("NOTICE: Indexing $dbfile: %d%%\r", int(100*$offset/$filesize));
$offset += $length;

}

for my $k ( sort {$a cmp $b} keys %region ) { print IDX join("\t", $k, $region{ $k }{ 'min' }, $region{ $k }{ 'max' }), "\n"; }

print STDERR "\nDone!\n"; close(DB); close(IDX);

=head1 SYNOPSIS

index_annovar.pl [arguments]

Optional arguments: -h, --help print help message -m, --man print complete documentation -v, --verbose use verbose output --filetype <A|B|C> file type (default: A) --bin BIN size (default: 1000) --outfile prefix of output file name --skipsort skip the pre-sorting procedure --commentfile provie comment lines (starting with #) from a comment file

Function: generate index for ANNOVAR database files. type A start with chr, type B starts with bin

Example: index_annovar.pl tempdb/hg19_cg69.txt -outfile humandb/hg19_cg69.txt index_annovar.pl tempdb/hg19_snp131.txt -outfile humandb/hg19_snp131.txt -filetype B

Version: $LastChangedDate: 2012-10-23 23:32:05 -0700 (Tue, 23 Oct 2012) $

WARNING: THIS PROGRAM IS STILL IN DEVELOPMENT PHASE AND MAY CONTAIN BUGS !

=head1 OPTIONS

=over 8

=item B<--help>

print a brief usage message and detailed explanation of options.

=item B<--man>

print the complete manual of the program.

=item B<--verbose>

use verbose output.

=back

=head1 DESCRIPTION

This program will generate a new database file as well as an index file, given a user-specified database.

The file type A, B and C are explained below:

=over 8

A: first two tab-delimited fields are chr and start

B: first three tab-delimited fields are anything, chr and start

C: first five tab-delimited fields are anything, anything, chr, anything and start

=back

=cut

kvn95ss commented 3 years ago

I get this error when using the script (I just executed perl index_annovar -h)

Global symbol "@argv" requires explicit package name (did you forget to declare "my @argv"?) at index_annovar.pl line 20.
Global symbol "@argv" requires explicit package name (did you forget to declare "my @argv"?) at index_annovar.pl line 21.
Global symbol "@argv" requires explicit package name (did you forget to declare "my @argv"?) at index_annovar.pl line 23.

Any package which I'm missing? Couldn't figure from the script

kvn95ss commented 3 years ago

There is another issue with tsvs, in the dataset deletions and insertions are shown as such -

>#data in tsv
chr1     x     y     TA     T
chr1     a     b     C     CT

If such variants are present in vcf, convert2annovar changes the entries as follows -

>#data from convert2annovar
chr1     x-1     y-1     A     -
chr1     a+x     b+x     -     T

Where - is used to show deletion or insertion depending upon the ref/alt column. If variants are described in the tsv schema are present in the genericdb, would they still be considered appropriately or could it lead to issues/mismatches during annotation? Would it be better to follow the second shown convention for genericdb?

kaichop commented 3 years ago

I do not know why you need to use convert2annovar. Users are not supposed to use it in most circumstances.

But for the specific issue below, you can add --keepindelref argument in convert2annovar to avoid the behavior.

On Wed, Mar 3, 2021 at 4:25 AM Karthik Nair notifications@github.com wrote:

There is another issue with tsvs, in the dataset deletions and insertions are shown as such -

data in tsv

chr1 x y TA T chr1 a b C CT

If such variants are present in vcf, convert2annovar changes the entries as follows -

data from convert2annovar

chr1 x-1 y-1 A - chr1 a+x b+x - T

Where - is used to show deletion or insertion depending upon the ref/alt column. If variants are described in the tsv schema are present in the genericdb, would they still be considered appropriately or could it lead to issues/mismatches during annotation? Would it be better to follow the second shown convention for genericdb?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/127#issuecomment-789569963, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OEPUKGBWFHGNHEIH53TBX6APANCNFSM4YL7UXOQ .

kvn95ss commented 3 years ago

Thank you for the inputs! It really helped me a lot.

dianacornejo commented 1 year ago

@kvn95ss Hello, I'm facing a similar issue. I just wonder how you dealt with supplying the CADD db (v.1.6 in our case) to annovar so that it can annotate the scores correctly? Thank you

dianacornejo commented 1 year ago

@argv

Hello @kvn95ss how did you solve this issue?