awilfert / PSAP-pipeline

14 stars 9 forks source link

Difficulties running script #3

Closed martiliasf closed 7 years ago

martiliasf commented 7 years ago

We're having trouble getting the scripts to run. Do you have any example data you can provide to ensure we have everything in the proper format? Also, which version of R is required? Some specific lines of code where we think we are having issues -

In individual_psap_pipeline.sh:

line 38:

cd $FILE_LOC # Use location of VCF file as working directory, this is where all output will be written

$FILE_LOC is a file, not a directory, so cd fails. Will it affect anything down stream if I comment this out?

line 71:

Rscript ${PSAP_PATH}psap/RScripts/apply_popStat_individual.R ${OUTFILE}.avinput $i $PSAP_PATH &

Does this require a fourth argument corresponding to the pedigree file?

Say:

Rscript ${PSAP_PATH}psap/RScripts/apply_popStat_individual.R ${OUTFILE}.avinput $i $PSAP_PATH $3 &

In apply_popStat_individual.R

line 9:

ped = read.table(args[4],stringsAsFactors=F,sep="\t")

Should my pedigree file be tab-delimited?

line 26:

exome.raw$scaled.cadd = as.numeric(sapply(exome.raw$cadd,function(x) unlist(strsplit(x,","))[2]))

I notice that the exome.raw datatable has a CADD column but not a cadd column, could I change exome.raw$cadd to exome.raw$CADD ?

JonathanRios1 commented 7 years ago

It seems there are a number of errors in the script(s). Is there a plan to update with a new download that contains updated scripts?

dmwebber commented 7 years ago

I agree with martiliasf that "cadd" on line 28 should be "CADD"

exome.raw$scaled.cadd = as.numeric(sapply(exome.raw$**cadd**,function(x) unlist(strsplit(x,","))[2]))

The script progresses further after changing CADD to uppercase and then halts with the following output:

Error in [.data.frame(exome.raw, , i) : undefined columns selected Calls: substr -> [ -> [.data.frame Execution halted

JonathanRios1 commented 7 years ago

Thanks. But does it expect the line to be comma-delimited with strsplit? The CADD is a single value with no comma or second value after a comma?

I might be interpreting the code incorrectly. Apologies if so.

Also, it seems there might be an inadvertent "psap" throughout the script:

(from the family*.sh file) bash ${PSAP_PATH}psap/annotate_PSAP.sh ${OUTFILE}.avinput $PED_FILE

From what I can see there is not psap sub-directory. Is this an issue?

Thanks for your help, we are excited to apply the software for a study about to go out.

Jonathan

Sent from my Android phone using TouchDown (www.nitrodesk.com)

-----Original Message-----

From: dmwebber [notifications@github.com] Received: Wednesday, 14 Dec 2016, 2:51PM To: awilfert/PSAP-pipeline [PSAP-pipeline@noreply.github.com] CC: JonathanRios1 [Jonathan.Rios@tsrh.org]; Manual [manual@noreply.github.com] Subject: Re: [awilfert/PSAP-pipeline] Difficulties running script (#3)

I agree with martiliasf that "cadd" on line 28 should be "CADD"

exome.raw$scaled.cadd = as.numeric(sapply(exome.raw$**cadd**,function(x) unlist(strsplit(x,","))[2]))

The script progresses further after changing CADD to uppercase and then halts with the following output:

Error in [.data.frame(exome.raw, , i) : undefined columns selected Calls: substr -> [ -> [.data.frame Execution halted

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_awilfert_PSAP-2Dpipeline_issues_3-23issuecomment-2D267152741&d=CwICaQ&c=XYvr6QrxjTiYCoEDRwsKEaM4jDxZSH6YHgL2HeYRPSI&r=VSchjHg4Kq2VrDtPQMk2iy_GIfl21090Rpjm_hJRM-A&m=CAgchCLEjB-t3xJVeseflTYHaooFokYsfFqO-J4Tj9E&s=rd3iLD5BlTrGMzNosFGDFigNPLKIKuFizvQSCKL9nrQ&e=


This email did not originate from a TSRHC email address. If you do not know or trust the sender, do not click on any links in this email, open any attachments, or disclose any sensitive information, such as your password.


Texas Scottish Rite Hospital for Children is one of the nation's leading pediatric centers for the treatment of orthopedic conditions, certain related neurological disorders and learning disorders, such as dyslexia. This email transmission and/or its attachments may contain confidential health information, intended only for the use of the individual or entity named above.

The authorized recipient of this information is prohibited from disclosing it to any other party unless required to do so by law and is required to delete/destroy the information after its stated need has been fulfilled. If you are not the intended recipient, any disclosure, copying, distribution or action taken in reliance on the contents of this email transmission is prohibited. If you have received this information in error, please notify the sender immediately and delete this information.

We appreciate your efforts to protect the children's confidential information.

awilfert commented 7 years ago

Hello all,

Thank you for letting us know about these issues. martiliasf, to answer your questions:

1) You can comment out the code in line 38, however if you do this you will be required to invoke the code in the directory where your input files are located. If left uncommented, any errors output by that line will not impact how the code runs.

2-3) The errors in lines 71 and line 9 of the R script were due to typos during another update and have been corrected.

4) The error in line 26 of the R code was caused by a recent update to the ANNOVAR output file format that we were unaware of that changed the whole genome cadd output form a single comma-delimited column to two separated columns. We just finished adding some exception handling to make the code compatible with this new file format.

Jonathan, you don't need to include the psap/ directory in the path to the psap file. We've hard coded that into the code for you.

All of these issues have been corrected and updated code is now available.

Sorry for the delay in addressing these issues, and thank you for your patience.

JonathanRios1 commented 7 years ago

Hi Amy,

Thanks so much for your effort to make adjustments. I am excited to try it out. Will download the new version and test.

Thanks again. Have a Merry Christmas.

Jonathan

Amy Wilfert notifications@github.com 12/21/16 1:46 PM >>> Hello all, Thank you for letting us know about these issues. martiliasf, to answer your questions:

  1. You can comment out the code in line 38, however if you do this you will be required to invoke the code in the directory where your input files are located. If left uncommented, any errors output by that line will not impact how the code runs. 2-3) The errors in lines 71 and line 9 of the R script were due to typos during another update and have been corrected.

  2. The error in line 26 of the R code was caused by a recent update to the ANNOVAR output file format that we were unaware of that changed the whole genome cadd output form a single comma-delimited column to two separated columns. We just finished adding some exception handling to make the code compatible with this new file format. Jonathan, you don't need to include the psap/ directory in the path to the psap file. We've hard coded that into the code for you. All of these issues have been corrected and updated code is now available. Sorry for the delay in addressing these issues, and thank you for your patience. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

NaN.

This email did not originate from a TSRHC email address. If you do not know or trust the sender, do not click on any links in this email, open any attachments, or disclose any sensitive information, such as your password.


Texas Scottish Rite Hospital for Children is one of the nation's leading pediatric centers for the treatment of orthopedic conditions, certain related neurological disorders and learning disorders, such as dyslexia. This email transmission and/or its attachments may contain confidential health information, intended only for the use of the individual or entity named above.

The authorized recipient of this information is prohibited from disclosing it to any other party unless required to do so by law and is required to delete/destroy the information after its stated need has been fulfilled. If you are not the intended recipient, any disclosure, copying, distribution or action taken in reliance on the contents of this email transmission is prohibited. If you have received this information in error, please notify the sender immediately and delete this information.

We appreciate your efforts to protect the children's confidential information.

JonathanRios1 commented 7 years ago

Hi Amy,

I seem to be getting an error in script RScripts/apply_popStat_individual.R at line 70. Error is:

Error in [.data.frame(exome.raw, af) : undefined columns selected

Unfortunately, I am not familiar enough to decode what is being done here that might be causing the error. Below are a few things that might help to show you the files I am working with:

names(exome.raw) [1] "Chr" "Start" [3] "End" "Ref" [5] "Alt" "Func.wgEncodeGencodeBasicV19" [7] "Gene.wgEncodeGencodeBasicV19" "GeneDetail.wgEncodeGencodeBasicV19" [9] "ExonicFunc.wgEncodeGencodeBasicV19" "AAChange.wgEncodeGencodeBasicV19" [11] "mac63kFreq_ALL" "esp6500si_all" [13] "1000g2014sep_all" "snp137" [15] "CADD" "CADD_Phred" [17] "#CHROM" "POS" [19] "ID" "REF" [21] "ALT" "QUAL" [23] "FILTER" "INFO" [25] "FORMAT" "563069" af [1] 563069 names(exome.raw)[26] [1] "563069" names(exome.raw)[26]==af [1] TRUE

Thanks for all of your help. I have been able to run the family analysis just fine.

Jonathan

JonathanRios1 commented 7 years ago

After a bit of trial and error, it seems that PSAP may not like sample IDs that start with numbers. I used a VCF and PED file that had an all numeric sample ID (such as 12345) and it failed as above. But when using the sample file and changing the sample ID in both the VCF and PED file to 'Sample_12345', it ran fine. Just FYI. Thanks again. Jonathan

dmwebber commented 7 years ago

Amy, Thanks so much for revising the script and adjusting for changes in ANNOVAR. We initally had some trouble with running the individual and family analyses, but it appears that the problem may have been caused by a couple of samples that did not have enough harmful variants to generate a Popstat. Once the offending samples were removed, the program ran smoothly. Thanks for your efforts, Daniel