awilfert / PSAP-pipeline

14 stars 9 forks source link

Problem with PSAP annotation #11

Closed martiliasf closed 7 years ago

martiliasf commented 7 years ago

NOTICE: Multianno output file is written to annotated/project.avinput.hg19_multianno.txt PROGRESS: Extracting individual IDs PROGRESS: Starting PSAP annotation [1] "/pine/scr/m/a/martyf/project_files" [1] "/pine/scr/m/a/martyf/project_files" [1] "/pine/scr/m/a/martyf/project_files" Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'NA': No such file or directory Execution halted Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'NA': No such file or directory Execution halted Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'NA': No such file or directory Execution halted PROGRESS: Generating report file for all individuals [1] "databases loaded" [1] "analysing affected individuals" Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'annotated/project_123-12345_popStat.txt': No such file or directory Execution halted

My hunch is that this is a problem with the file paths not working properly.

martiliasf commented 7 years ago

Additional information - the data is from WGS, and the VCF file has 3 subjects, all cases.

awilfert commented 7 years ago

Can you post the arguments you provided the script and the sample IDs in the header file?

martiliasf commented 7 years ago

Absolutely! Here is my latest attempt (I tried renamining the subjects IDs to not start with numbers as described by @JonathanRios1 .

This is the command I used:

individual_psap_pipeline.sh /pine/scr/m/a/martyf/PASH_files/pash3nh.vcf pash_3 /pine/scr/m/a/martyf/PASH_files/pash3nh.ped

Here is the .ped file

pash1030 pash1030 0 0 1 2
pash1035 pash1035 0 0 1 2
pash1036 pash1036 0 0 1 2

Here are the sample IDs in the header file:

[martyf@longleaf-login1 PASH_files]$ bcftools view -h pash3nh.vcf | tail -1
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  pash1030        pash1035        pash1036

And here is the error output of this run:

NOTICE: Multianno output file is written to annotated/pash_3.avinput.hg19_multianno.txt PROGRESS: Extracting individual IDs PROGRESS: Starting PSAP annotation [1] "/pine/scr/m/a/martyf/PASH_files" [1] "/pine/scr/m/a/martyf/PASH_files" [1] "/pine/scr/m/a/martyf/PASH_files" Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'NA': No such file or directory Execution halted Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'NA': No such file or directory Execution halted Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'NA': No such file or directory Execution halted PROGRESS: Generating report file for all individuals [1] "databases loaded" [1] "analysing affected individuals" Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'annotated/pash_3_pash1030_popStat.txt': No such file or directory Execution halted

awilfert commented 7 years ago

Based on the error it looks like a file is missing, but it's unclear which one. Can you check to make sure the pash_3.avinput.hg19_multianno.txt (should be in the annotated/ directory) and pash_3.avinput.header files were created? And then check that all of the lookup tables, etc. that are listed on github are also present in the psap/lookups/ directory? Also did you get any errors or warnings from the ANNOVAR annotation step?

martiliasf commented 7 years ago

Thanks for taking the time to debug this with me!

It seems that the .header file is not present in the annotated directory.

[martyf@longleaf-login1 annotated]$ pwd
/pine/scr/m/a/martyf/PASH_files/annotated
[martyf@longleaf-login1 annotated]$ dir
pash_3.avinput.hg19_multianno.txt

Although I just noticed its in the directory below that one:

[martyf@longleaf-login1 PASH_files]$ pwd
/pine/scr/m/a/martyf/PASH_files
[martyf@longleaf-login1 PASH_files]$ dir *.header
pash3.header

Here are the contents of the lookups directory

[martyf@longleaf-login1 lookups]$ pwd
/nas02/apps/psap-2.0/psap/lookups
[martyf@longleaf-login1 lookups]$ dir
blacklist_032115.txt                                  full.lof.pCADD.gencodeV19.allsites.txt.gz
blacklist_122814.txt                                  gene_coverage_stats_final_12172014.txt
full.chet.pCADD.gencodeV19.allsites.txt.gz            hg19_mac63kFreq_ALL.txt.gz
full.chet.pCADD.gencodeV19.recalibrated.allsites.txt  hgmd_pro_2013_4.12202014.annotated.txt
full.het.pCADD.gencodeV19.allsites.txt.gz             kldhom.txt
full.hom.pCADD.gencodeV19.allsites.txt.gz             lookup_genes.txt
[martyf@longleaf-login1 lookups]$ 

it looks like everything thats in https://github.com/awilfert/PSAP-pipeline/tree/master/lookups is there ... and then some more files.

For your reference, this is the rest of the output - I don't see any errors / warnings that I thought were worth reporting.

[martyf@longleaf-login1 20170306_psap]$ cat 20170307.out 
/pine/scr/m/a/martyf/20170306_psap
/pine/scr/m/a/martyf/PASH_files
PROGRESS: Converting VCF file to annovar input
NOTICE: Read 8593717 lines and wrote 4502033 different variants at 8593569 genomic positions (7116183 SNPs and 1477386 indels)
NOTICE: Among 8593569 different variants at 8593569 positions, 3510604 are heterozygotes, 991429 are homozygotes
NOTICE: Among 7116183 SNPs, 4775480 are transitions, 2331963 are transversions (ratio=2.05), 8740 have more than 2 alleles
ls: cannot access /pine/scr/m/a/martyf/PASH_files/*/: No such file or directory
Creating directory annotated/ to store annotation and analysis output
PROGRESS: Annotating data with ANNOVAR
-----------------------------------------------------------------
NOTICE: Processing operation=g protocol=wgEncodeGencodeBasicV19

NOTICE: Running with system command <annotate_variation.pl -geneanno -buildver hg19 -dbtype wgEncodeGencodeBasicV19 -outfile annotated/pash_3.avinput.wgEncodeGencodeBasicV19 -exonsort pash_3.avinput /nas/longleaf/apps/annovar/20160201/annovar/humandb/ -separate>
NOTICE: Output files were written to annotated/pash_3.avinput.wgEncodeGencodeBasicV19.variant_function, annotated/pash_3.avinput.wgEncodeGencodeBasicV19.exonic_variant_function
NOTICE: Reading gene annotation from /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_wgEncodeGencodeBasicV19.txt ... Done with 95929 transcripts (including 38291 without coding sequence annotation) for 42594 unique genes
NOTICE: Processing next batch with 5000000 unique variants in 5000000 input lines
NOTICE: Finished analyzing 1000000 query variants
NOTICE: Finished analyzing 2000000 query variants
NOTICE: Finished analyzing 3000000 query variants
NOTICE: Finished analyzing 4000000 query variants
NOTICE: Reading FASTA sequences from /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_wgEncodeGencodeBasicV19Mrna.fa ... Done with 25965 sequences
WARNING: A total of 303 sequences will be ignored due to lack of correct ORF annotation
NOTICE: Processing next batch with 3593569 unique variants in 3593569 input lines
NOTICE: Finished analyzing 1000000 query variants
NOTICE: Finished analyzing 2000000 query variants
NOTICE: Finished analyzing 3000000 query variants
NOTICE: Reading FASTA sequences from /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_wgEncodeGencodeBasicV19Mrna.fa ... Done with 11358 sequences
WARNING: A total of 303 sequences will be ignored due to lack of correct ORF annotation
-----------------------------------------------------------------
NOTICE: Processing operation=f protocol=mac63kFreq_ALL

NOTICE: Running system command <annotate_variation.pl -filter -dbtype mac63kFreq_ALL -buildver hg19 -outfile annotated/pash_3.avinput pash_3.avinput /nas/longleaf/apps/annovar/20160201/annovar/humandb/>
NOTICE: the --dbtype mac63kFreq_ALL is assumed to be in generic ANNOVAR database format
NOTICE: Variants matching filtering criteria are written to annotated/pash_3.avinput.hg19_mac63kFreq_ALL_dropped, other variants are written to annotated/pash_3.avinput.hg19_mac63kFreq_ALL_filtered
NOTICE: Processing next batch with 5000000 unique variants in 5000000 input lines
NOTICE: Scanning filter database /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_mac63kFreq_ALL.txt...Done
NOTICE: Processing next batch with 3593569 unique variants in 3593569 input lines
NOTICE: Scanning filter database /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_mac63kFreq_ALL.txt...Done
-----------------------------------------------------------------
NOTICE: Processing operation=f protocol=esp6500si_all

NOTICE: Running system command <annotate_variation.pl -filter -dbtype esp6500si_all -buildver hg19 -outfile annotated/pash_3.avinput pash_3.avinput /nas/longleaf/apps/annovar/20160201/annovar/humandb/>
NOTICE: the --dbtype esp6500si_all is assumed to be in generic ANNOVAR database format
NOTICE: Variants matching filtering criteria are written to annotated/pash_3.avinput.hg19_esp6500si_all_dropped, other variants are written to annotated/pash_3.avinput.hg19_esp6500si_all_filtered
NOTICE: Processing next batch with 5000000 unique variants in 5000000 input lines
NOTICE: Database index loaded. Total number of bins is 191802 and the number of bins to be scanned is 106972
NOTICE: Scanning filter database /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_esp6500si_all.txt...Done
NOTICE: Processing next batch with 3593569 unique variants in 3593569 input lines
NOTICE: Database index loaded. Total number of bins is 191802 and the number of bins to be scanned is 52459
NOTICE: Scanning filter database /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_esp6500si_all.txt...Done
-----------------------------------------------------------------
NOTICE: Processing operation=f protocol=1000g2014sep_all

NOTICE: Running system command <annotate_variation.pl -filter -dbtype 1000g2014sep_all -buildver hg19 -outfile annotated/pash_3.avinput pash_3.avinput /nas/longleaf/apps/annovar/20160201/annovar/humandb/>
NOTICE: Variants matching filtering criteria are written to annotated/pash_3.avinput.hg19_ALL.sites.2014_09_dropped, other variants are written to annotated/pash_3.avinput.hg19_ALL.sites.2014_09_filtered
NOTICE: Processing next batch with 5000000 unique variants in 5000000 input lines
NOTICE: Database index loaded. Total number of bins is 2665525 and the number of bins to be scanned is 1415316
NOTICE: Scanning filter database /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_ALL.sites.2014_09.txt...Done
NOTICE: Processing next batch with 3593569 unique variants in 3593569 input lines
NOTICE: Database index loaded. Total number of bins is 2665525 and the number of bins to be scanned is 950078
NOTICE: Scanning filter database /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_ALL.sites.2014_09.txt...Done
-----------------------------------------------------------------
NOTICE: Processing operation=f protocol=snp137

NOTICE: Running system command <annotate_variation.pl -filter -dbtype snp137 -buildver hg19 -outfile annotated/pash_3.avinput pash_3.avinput /nas/longleaf/apps/annovar/20160201/annovar/humandb/>
NOTICE: Variants matching filtering criteria are written to annotated/pash_3.avinput.hg19_snp137_dropped, other variants are written to annotated/pash_3.avinput.hg19_snp137_filtered
NOTICE: Processing next batch with 5000000 unique variants in 5000000 input lines
NOTICE: Database index loaded. Total number of bins is 2856306 and the number of bins to be scanned is 1416246
NOTICE: Scanning filter database /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_snp137.txt...Done
NOTICE: Processing next batch with 3593569 unique variants in 3593569 input lines
NOTICE: Database index loaded. Total number of bins is 2856306 and the number of bins to be scanned is 1053377
NOTICE: Scanning filter database /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_snp137.txt...Done
-----------------------------------------------------------------
NOTICE: Processing operation=f protocol=cadd
NOTICE: Finished reading 2 column headers for '-dbtype cadd'

NOTICE: Running system command <annotate_variation.pl -filter -dbtype cadd -buildver hg19 -outfile annotated/pash_3.avinput pash_3.avinput /nas/longleaf/apps/annovar/20160201/annovar/humandb/ -otherinfo -otherinfo>
NOTICE: the --dbtype cadd is assumed to be in generic ANNOVAR database format
NOTICE: Variants matching filtering criteria are written to annotated/pash_3.avinput.hg19_cadd_dropped, other variants are written to annotated/pash_3.avinput.hg19_cadd_filtered
NOTICE: Processing next batch with 5000000 unique variants in 5000000 input lines
NOTICE: Database index loaded. Total number of bins is 14293554 and the number of bins to be scanned is 3333364
NOTICE: Scanning filter database /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_cadd.txt...Done
NOTICE: Processing next batch with 3593569 unique variants in 3593569 input lines
NOTICE: Database index loaded. Total number of bins is 14293554 and the number of bins to be scanned is 2414470
NOTICE: Scanning filter database /nas/longleaf/apps/annovar/20160201/annovar/humandb/hg19_cadd.txt...Done
-----------------------------------------------------------------
NOTICE: Multianno output file is written to annotated/pash_3.avinput.hg19_multianno.txt
awilfert commented 7 years ago

So, I think the problem has to do with the name of your header file. The R scripts are expecting it to have the format .avinput.header, and this is how it's coded, but for some reason your header file is just .header.... can you check line 49 in the individual_psap_pipeline.sh? The code there should look like this: "grep '#' $VCF | tail -n 1 > ${OUTFILE}.avinput.header" .

martiliasf commented 7 years ago

Gah. I apologize, that dir *.header output was a red herring - that header file exists because I used bcftools rehead to modify the sample IDs. .... long story short, the script did create pash_3.avinput.header and pash_3.avinput

Line 49 is fine.

should the .avinput files be in /annotated ?

I'm going to try running it on a clean directory. OK, i'll report back in 6-8 hours.

martiliasf commented 7 years ago

OK. New run, I've attached all the files. Let me know if you'd rather I paste them in the text box.

20170308.txt psap_script.txt

And here are the directories:

[martyf@longleaf-login2 20170308_source]$ pwd
/pine/scr/m/a/martyf/20170308_source
[martyf@longleaf-login2 20170308_source]$ ls -lh
total 2.2G
drwxr-xr-x 2 martyf its_graduate_psx 4.0K Mar  8 18:48 annotated
-rw-r--r-- 1 martyf its_graduate_psx 1.2G Mar  8 14:02 pash3nh_out.avinput
-rw-r--r-- 1 martyf its_graduate_psx   73 Mar  8 14:02 pash3nh_out.avinput.header
-rw-r--r-- 1 martyf its_graduate_psx   79 Mar  8 13:55 pash3nh.ped
-rw-r--r-- 1 martyf its_graduate_psx 982M Mar  8 13:55 pash3nh.vcf
-rw-r--r-- 1 martyf its_graduate_psx  440 Mar 10 09:22 vcf_directory.txt
/pine/scr/m/a/martyf/20170308_source/annotated
[martyf@longleaf-login2 annotated]$ ls -lh
total 1.9G
-rw-r--r-- 1 martyf its_graduate_psx 1.9G Mar  8 18:48 pash3nh_out.avinput.hg19_multianno.txt
awilfert commented 7 years ago

Based on what you've shared it looks like everything should be working. Can you go to the annotate_popStat_individual file and add some kind of print statement before each call to scan() or read.table() so we can figure out which file/argument is giving us trouble.

martiliasf commented 7 years ago

Wow. So somehow the line here:

Rscript ${PSAP_PATH}psap/RScripts/apply_popStat_individual.R ${OUTFILE}.avinput $i $PSAP_PATH $PED_FILE &

had the $PED_FILE section dropped in my version of individual_psap_pipeline.sh, so the arg[4] command trying to find the ped_file was the one throwing the error

Sorry for the walls of text, and thank you for your time and assistance.