CPTR-ReSeqTB / UVP

Mycobacterium tuberculosis next generation sequence analysis
MIT License
21 stars 12 forks source link

Portability & Project Structure #19

Closed dfornika closed 4 years ago

dfornika commented 5 years ago

I'd like to propose a few changes that may help with #7 .

mezewudo commented 5 years ago

Awesome effort @dfornika

I am still looking through the changes, I must confess this looks way much tidier. Did you by chance do any debugging to see that the whole of these changes performed as expected.

Also, the snpEff tool might pose some little challenge when porting to/from anaconda, as it is we need to build in a customized M. tuberculosis H37Rv database to the tool, else if we run as default it will not properly analyse MTBC data, as the database does not have the specific H37RV reference. Wondering what the walk around will be, maybe create the customized snpEff and find a way to point to it in anaconda, what will you think?

dfornika commented 5 years ago

Hi @mezewudo, thanks for taking the time to review. I haven't been able to do thorough testing yet. I had considered using the dataset described in the nextstrain tb tutorial. If you could recommend another public dataset that might be useful for testing, that would be helpful.

If there is a public dataset for the customized M. tuberculosis H37Rv available that can be downloaded using command-line tools like wget or curl, then it may be possible to include the database-building step as part of the build process. Otherwise, maybe a helper script could be produced that the user could run after installation that would download and build the snpEff database.

mezewudo commented 5 years ago

Dan, I will try and put up a couple of MTBC input fastq files to a public repository and share the link with you, so you could use for testing.

I think the second suggestion on the snpEff database might be more feasible, but I will look through their manual to see which approach might be easier.

On Fri, Mar 22, 2019 at 8:14 AM Dan Fornika notifications@github.com wrote:

Hi @mezewudo https://github.com/mezewudo, thanks for taking the time to review. I haven't been able to do thorough testing yet. I had considered using the dataset described in the nextstrain tb tutorial https://nextstrain.org/docs/getting-started/tb-tutorial#download-data. If you could recommend another public dataset that might be useful for testing, that would be helpful.

If there is a public dataset for the customized M. tuberculosis H37Rv available that can be downloaded using command-line tools like wget or curl, then it may be possible to include the database-building step as part of the build process. Otherwise, maybe a helper script could be produced that the user could run after installation that would download and build the snpEff database.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CPTR-ReSeqTB/UVP/pull/19#issuecomment-475659612, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLYb5WZo3Ks8sqRJ0a6gTD5ZwHNiqtaks5vZPNIgaJpZM4bxgYY .

mezewudo commented 5 years ago

@dfornika ,

1) I have uploaded a pair of sample Mycobacterium tuberculosis input fastq files, that will help with the testing of the changes. You will find those in this link:

https://figshare.com/articles/ERR552106_1_fastq_gz/7887236 https://figshare.com/articles/ERR552106_2_fastq_gz/7887242

2) For the snpEff H37Rv reference database to be included in the version 4.1 of the tool, to enable annotation on MTBC, I have also included a link to a binary version of the refernce genome annotation: https://figshare.com/articles/snpEffectPredictor_bin/7887230

Essentially, after installation of SnpEff version 4.1, the user will need to download this binary file. create a folder called NC_000962 in the data folder of snpEff and place the binary file in the NC_000962 file path.

I guess when you try it out, you will tell if it is better to write a small script to acheive this, or to document some quick instructions on how to go about it.

dfornika commented 5 years ago

Hi @mezewudo sorry for disappearing for a couple of weeks. I realized that I had been a little bit more heavy-handed than necessary with some of the refactoring that I had done for this pull-request. Specifically, I had removed several attributes from the snp object in UVP.py (now uvp/snp.py) that are set to store paths to various tools (self.__bwa, self.__samtools, self.__kraken, etc.).

I realized that it would be less disruptive to leave those attributes in place and simply re-define their values to the names of the various tools, which should be available on the $PATH if they are supplied by a conda environment.

I've downloaded your test files and will update again soon with the results of my test run(s).

dfornika commented 5 years ago

Hi @mezewudo. I've been doing some testing using your datasets. It's looking good, but in order to complete a run I'll need a 'known sites' snps.vcf file. Do you have one available?

mezewudo commented 5 years ago

Hi Dan,

Here is a link to the snps.vcffile:

https://figshare.com/articles/Mycobacterium_tuberculosis_variations_list/5987341

On Fri, Apr 12, 2019 at 5:49 PM Dan Fornika notifications@github.com wrote:

Hi @mezewudo https://github.com/mezewudo. I've been doing some testing using your datasets. It's looking good, but in order to complete a run I'll need a 'known sites' snps.vcf file. Do you have one available?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CPTR-ReSeqTB/UVP/pull/19#issuecomment-482761689, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLYb3D4IF8yjKWxiXz-pzAk8DIfnLg-ks5vgSmlgaJpZM4bxgYY .

dfornika commented 5 years ago

The pipeline mostly seems to be running well but I am running into an issue with GATK BaseRecalibrator. It's telling me that it's running out of memory despite giving it up to 24GB of memory.

The stderr log does say that it's picking up the JAVA_TOOL_OPTIONS=-Xmx24g environment variable but I'm not certain that it's having the desired effect.

---[ BaseRecalibrator ]---
Command: 
env JAVA_TOOL_OPTIONS="-Xmx24g" gatk -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Results/output/t

Standard Error: 
Picked up JAVA_TOOL_OPTIONS: "-Xmx24g"
INFO  14:49:27,518 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  14:49:27,520 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 
INFO  14:49:27,521 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  14:49:27,521 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk 
INFO  14:49:27,521 HelpFormatter - [Mon Apr 15 14:49:27 PDT 2019] Executing on Linux 3.10.0-229.14.1.el7.x86_64 amd64 
INFO  14:49:27,521 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_192-b01 JdkDeflater 
INFO  14:49:27,524 HelpFormatter - Program Args: -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Resu
INFO  14:49:27,536 HelpFormatter - Executing as dfornika@sabin.jgardy.bcgsc.ca on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01. 
INFO  14:49:27,537 HelpFormatter - Date/Time: 2019/04/15 14:49:27 
INFO  14:49:27,537 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  14:49:27,537 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  14:49:27,587 GenomeAnalysisEngine - Strictness is SILENT 
INFO  14:49:27,697 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  14:49:27,703 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  14:49:27,740 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 
INFO  14:49:28,160 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine 
INFO  14:49:28,304 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  14:49:28,310 GenomeAnalysisEngine - Done preparing for traversal 
INFO  14:49:28,311 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  14:49:28,311 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  14:49:28,312 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime 
INFO  14:49:28,345 BaseRecalibrator - The covariates being used here:  
INFO  14:49:28,346 BaseRecalibrator -   ReadGroupCovariate 
INFO  14:49:28,346 BaseRecalibrator -   QualityScoreCovariate 
INFO  14:49:28,346 BaseRecalibrator -   ContextCovariate 
INFO  14:49:28,346 ContextCovariate -           Context sizes: base substitution model 2, indel substitution model 3 
INFO  14:49:28,346 BaseRecalibrator -   CycleCovariate 
INFO  14:49:28,365 ReadShardBalancer$1 - Loading BAM index data 
INFO  14:49:28,366 ReadShardBalancer$1 - Done loading BAM index data 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.6-0-g89b7209): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.
##### ERROR
dfornika commented 5 years ago

I suspect that this comment might offer a clue:

https://gatkforums.broadinstitute.org/gatk/discussion/comment/8676/#Comment_8676

dfornika commented 5 years ago

I've tried running the pipeline using the reference and vcf files from the nextstrain tb tutorial data and the BaseRecalibrator step did complete successfully this time.

---[ BaseRecalibrator ]---
Command: 
env JAVA_TOOL_OPTIONS="-Xmx24g" gatk -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Results/output/t

Standard Output: 
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------

Standard Error: 
Picked up JAVA_TOOL_OPTIONS: "-Xmx24g"
INFO  13:49:14,408 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  13:49:14,409 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 
INFO  13:49:14,410 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  13:49:14,410 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk 
INFO  13:49:14,410 HelpFormatter - [Tue Apr 16 13:49:14 PDT 2019] Executing on Linux 3.10.0-229.14.1.el7.x86_64 amd64 
INFO  13:49:14,410 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_192-b01 JdkDeflater 
INFO  13:49:14,413 HelpFormatter - Program Args: -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Resu
INFO  13:49:14,426 HelpFormatter - Executing as dfornika@sabin.jgardy.bcgsc.ca on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01. 
INFO  13:49:14,426 HelpFormatter - Date/Time: 2019/04/16 13:49:14 
INFO  13:49:14,427 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  13:49:14,427 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  13:49:14,450 GenomeAnalysisEngine - Strictness is SILENT 
INFO  13:49:14,587 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  13:49:14,594 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  13:49:14,622 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 
INFO  13:49:16,010 RMDTrackBuilder - Writing Tribble index to disk for file /home/dfornika/code/UVP/uvp/data/snps.vcf.idx 
INFO  13:49:16,036 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 4 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine 
INFO  13:49:16,095 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  13:49:16,099 GenomeAnalysisEngine - Done preparing for traversal 
INFO  13:49:16,100 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  13:49:16,100 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  13:49:16,100 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime 
INFO  13:49:16,124 BaseRecalibrator - The covariates being used here:  
INFO  13:49:16,125 BaseRecalibrator -   ReadGroupCovariate 
INFO  13:49:16,125 BaseRecalibrator -   QualityScoreCovariate 
INFO  13:49:16,125 BaseRecalibrator -   ContextCovariate 
INFO  13:49:16,125 ContextCovariate -           Context sizes: base substitution model 2, indel substitution model 3 
INFO  13:49:16,125 BaseRecalibrator -   CycleCovariate 
INFO  13:49:16,128 ReadShardBalancer$1 - Loading BAM index data 
INFO  13:49:16,129 ReadShardBalancer$1 - Done loading BAM index data 
INFO  13:49:46,203 ProgressMeter - MTB_anc:2702205    700009.0    30.0 s      43.0 s       61.3%    48.0 s      18.0 s 
INFO  13:49:59,798 BaseRecalibrator - Calculating quantized quality scores... 
INFO  13:49:59,830 BaseRecalibrator - Writing recalibration report... 
INFO  13:50:00,698 BaseRecalibrator - ...done! 
INFO  13:50:00,698 BaseRecalibrator - BaseRecalibrator was able to recalibrate 1302711 reads 
INFO  13:50:00,699 ProgressMeter -            done   1302715.0    44.0 s      34.0 s      100.0%    44.0 s       0.0 s 
INFO  13:50:00,699 ProgressMeter - Total runtime 44.60 secs, 0.74 min, 0.01 hours
INFO  13:50:00,699 MicroScheduler - 41447 reads were filtered out during the traversal out of approximately 1344162 total reads (3.08%) 
INFO  13:50:00,699 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  13:50:00,700 MicroScheduler -   -> 10455 reads (0.78% of total) failing DuplicateReadFilter 
INFO  13:50:00,700 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
INFO  13:50:00,700 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
INFO  13:50:00,700 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter 
INFO  13:50:00,700 MicroScheduler -   -> 30992 reads (2.31% of total) failing MappingQualityZeroFilter 
INFO  13:50:00,701 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter 
INFO  13:50:00,701 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter
mezewudo commented 5 years ago

I will try next week to pull down your version of the tool and run locally to see how it runs. I may have to reach back to you for the copy, as the changes proposed are on the master branch and there is no dev branch.

On Tue, Apr 16, 2019 at 5:07 PM Dan Fornika notifications@github.com wrote:

I've tried running the pipeline using the reference and vcf files from the nextstrain tb tutorial data https://nextstrain.org/docs/getting-started/tb-tutorial#download-data and the BaseRecalibrator step did complete successfully this time.

---[ BaseRecalibrator ]--- Command: env JAVA_TOOL_OPTIONS="-Xmx24g" gatk -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Results/output/t

Standard Output:

Done. There were no warn messages.

Standard Error: Picked up JAVA_TOOL_OPTIONS: "-Xmx24g" INFO 13:49:14,408 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,409 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 INFO 13:49:14,410 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 13:49:14,410 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk INFO 13:49:14,410 HelpFormatter - [Tue Apr 16 13:49:14 PDT 2019] Executing on Linux 3.10.0-229.14.1.el7.x86_64 amd64 INFO 13:49:14,410 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_192-b01 JdkDeflater INFO 13:49:14,413 HelpFormatter - Program Args: -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Resu INFO 13:49:14,426 HelpFormatter - Executing as dfornika@sabin.jgardy.bcgsc.ca on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01. INFO 13:49:14,426 HelpFormatter - Date/Time: 2019/04/16 13:49:14 INFO 13:49:14,427 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,427 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,450 GenomeAnalysisEngine - Strictness is SILENT INFO 13:49:14,587 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:49:14,594 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:49:14,622 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 13:49:16,010 RMDTrackBuilder - Writing Tribble index to disk for file /home/dfornika/code/UVP/uvp/data/snps.vcf.idx INFO 13:49:16,036 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 4 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine INFO 13:49:16,095 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:49:16,099 GenomeAnalysisEngine - Done preparing for traversal INFO 13:49:16,100 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:49:16,100 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:49:16,100 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 13:49:16,124 BaseRecalibrator - The covariates being used here: INFO 13:49:16,125 BaseRecalibrator - ReadGroupCovariate INFO 13:49:16,125 BaseRecalibrator - QualityScoreCovariate INFO 13:49:16,125 BaseRecalibrator - ContextCovariate INFO 13:49:16,125 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 13:49:16,125 BaseRecalibrator - CycleCovariate INFO 13:49:16,128 ReadShardBalancer$1 - Loading BAM index data INFO 13:49:16,129 ReadShardBalancer$1 - Done loading BAM index data INFO 13:49:46,203 ProgressMeter - MTB_anc:2702205 700009.0 30.0 s 43.0 s 61.3% 48.0 s 18.0 s INFO 13:49:59,798 BaseRecalibrator - Calculating quantized quality scores... INFO 13:49:59,830 BaseRecalibrator - Writing recalibration report... INFO 13:50:00,698 BaseRecalibrator - ...done! INFO 13:50:00,698 BaseRecalibrator - BaseRecalibrator was able to recalibrate 1302711 reads INFO 13:50:00,699 ProgressMeter - done 1302715.0 44.0 s 34.0 s 100.0% 44.0 s 0.0 s INFO 13:50:00,699 ProgressMeter - Total runtime 44.60 secs, 0.74 min, 0.01 hours INFO 13:50:00,699 MicroScheduler - 41447 reads were filtered out during the traversal out of approximately 1344162 total reads (3.08%) INFO 13:50:00,699 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 13:50:00,700 MicroScheduler - -> 10455 reads (0.78% of total) failing DuplicateReadFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 13:50:00,700 MicroScheduler - -> 30992 reads (2.31% of total) failing MappingQualityZeroFilter INFO 13:50:00,701 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 13:50:00,701 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/CPTR-ReSeqTB/UVP/pull/19#issuecomment-483843720, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLYb3hVY5jf6dzOzr38P2qMrR1uKlO5ks5vhjuBgaJpZM4bxgYY .

mezewudo commented 5 years ago

Dan,

Could you send me a link to the repo of this reworked version, that I can clone and test out locally? My hope is to try and install and test out on my end to check on any issues with the BaseCalibrator etc, and if everything pans out well, It will just be a question of accepting and merging the entire changes at once.

On Tue, Apr 16, 2019 at 2:07 PM Dan Fornika notifications@github.com wrote:

I've tried running the pipeline using the reference and vcf files from the nextstrain tb tutorial data https://nextstrain.org/docs/getting-started/tb-tutorial#download-data and the BaseRecalibrator step did complete successfully this time.

---[ BaseRecalibrator ]--- Command: env JAVA_TOOL_OPTIONS="-Xmx24g" gatk -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Results/output/t

Standard Output:

Done. There were no warn messages.

Standard Error: Picked up JAVA_TOOL_OPTIONS: "-Xmx24g" INFO 13:49:14,408 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,409 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 INFO 13:49:14,410 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 13:49:14,410 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk INFO 13:49:14,410 HelpFormatter - [Tue Apr 16 13:49:14 PDT 2019] Executing on Linux 3.10.0-229.14.1.el7.x86_64 amd64 INFO 13:49:14,410 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_192-b01 JdkDeflater INFO 13:49:14,413 HelpFormatter - Program Args: -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Resu INFO 13:49:14,426 HelpFormatter - Executing as dfornika@sabin.jgardy.bcgsc.ca on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01. INFO 13:49:14,426 HelpFormatter - Date/Time: 2019/04/16 13:49:14 INFO 13:49:14,427 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,427 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,450 GenomeAnalysisEngine - Strictness is SILENT INFO 13:49:14,587 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:49:14,594 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:49:14,622 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 13:49:16,010 RMDTrackBuilder - Writing Tribble index to disk for file /home/dfornika/code/UVP/uvp/data/snps.vcf.idx INFO 13:49:16,036 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 4 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine INFO 13:49:16,095 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:49:16,099 GenomeAnalysisEngine - Done preparing for traversal INFO 13:49:16,100 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:49:16,100 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:49:16,100 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 13:49:16,124 BaseRecalibrator - The covariates being used here: INFO 13:49:16,125 BaseRecalibrator - ReadGroupCovariate INFO 13:49:16,125 BaseRecalibrator - QualityScoreCovariate INFO 13:49:16,125 BaseRecalibrator - ContextCovariate INFO 13:49:16,125 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 13:49:16,125 BaseRecalibrator - CycleCovariate INFO 13:49:16,128 ReadShardBalancer$1 - Loading BAM index data INFO 13:49:16,129 ReadShardBalancer$1 - Done loading BAM index data INFO 13:49:46,203 ProgressMeter - MTB_anc:2702205 700009.0 30.0 s 43.0 s 61.3% 48.0 s 18.0 s INFO 13:49:59,798 BaseRecalibrator - Calculating quantized quality scores... INFO 13:49:59,830 BaseRecalibrator - Writing recalibration report... INFO 13:50:00,698 BaseRecalibrator - ...done! INFO 13:50:00,698 BaseRecalibrator - BaseRecalibrator was able to recalibrate 1302711 reads INFO 13:50:00,699 ProgressMeter - done 1302715.0 44.0 s 34.0 s 100.0% 44.0 s 0.0 s INFO 13:50:00,699 ProgressMeter - Total runtime 44.60 secs, 0.74 min, 0.01 hours INFO 13:50:00,699 MicroScheduler - 41447 reads were filtered out during the traversal out of approximately 1344162 total reads (3.08%) INFO 13:50:00,699 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 13:50:00,700 MicroScheduler - -> 10455 reads (0.78% of total) failing DuplicateReadFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 13:50:00,700 MicroScheduler - -> 30992 reads (2.31% of total) failing MappingQualityZeroFilter INFO 13:50:00,701 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 13:50:00,701 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CPTR-ReSeqTB/UVP/pull/19#issuecomment-483843720, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLYb3hVY5jf6dzOzr38P2qMrR1uKlO5ks5vhjuBgaJpZM4bxgYY .

dfornika commented 5 years ago

My changes are in the portable branch of this repository: https://github.com/dfornika/UVP.git

I think you should be able to clone that repository and checkout the portable branch.

I've included instructions for setting up a conda environment with all dependencies on the README.md of that branch:

https://github.com/dfornika/UVP/blob/portable/README.md

Those instructions don't include the snpEff database step we discussed above, so that would need to be done separately. Thanks for your co-operation and let me know if you run into any issues either here or by email: dan.fornika [at] bccdc.ca