Closed dfornika closed 4 years ago
Awesome effort @dfornika
I am still looking through the changes, I must confess this looks way much tidier. Did you by chance do any debugging to see that the whole of these changes performed as expected.
Also, the snpEff tool might pose some little challenge when porting to/from anaconda, as it is we need to build in a customized M. tuberculosis H37Rv database to the tool, else if we run as default it will not properly analyse MTBC data, as the database does not have the specific H37RV reference. Wondering what the walk around will be, maybe create the customized snpEff and find a way to point to it in anaconda, what will you think?
Hi @mezewudo, thanks for taking the time to review. I haven't been able to do thorough testing yet. I had considered using the dataset described in the nextstrain tb tutorial. If you could recommend another public dataset that might be useful for testing, that would be helpful.
If there is a public dataset for the customized M. tuberculosis H37Rv available that can be downloaded using command-line tools like wget
or curl
, then it may be possible to include the database-building step as part of the build process. Otherwise, maybe a helper script could be produced that the user could run after installation that would download and build the snpEff database.
Dan, I will try and put up a couple of MTBC input fastq files to a public repository and share the link with you, so you could use for testing.
I think the second suggestion on the snpEff database might be more feasible, but I will look through their manual to see which approach might be easier.
On Fri, Mar 22, 2019 at 8:14 AM Dan Fornika notifications@github.com wrote:
Hi @mezewudo https://github.com/mezewudo, thanks for taking the time to review. I haven't been able to do thorough testing yet. I had considered using the dataset described in the nextstrain tb tutorial https://nextstrain.org/docs/getting-started/tb-tutorial#download-data. If you could recommend another public dataset that might be useful for testing, that would be helpful.
If there is a public dataset for the customized M. tuberculosis H37Rv available that can be downloaded using command-line tools like wget or curl, then it may be possible to include the database-building step as part of the build process. Otherwise, maybe a helper script could be produced that the user could run after installation that would download and build the snpEff database.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CPTR-ReSeqTB/UVP/pull/19#issuecomment-475659612, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLYb5WZo3Ks8sqRJ0a6gTD5ZwHNiqtaks5vZPNIgaJpZM4bxgYY .
@dfornika ,
1) I have uploaded a pair of sample Mycobacterium tuberculosis input fastq files, that will help with the testing of the changes. You will find those in this link:
https://figshare.com/articles/ERR552106_1_fastq_gz/7887236 https://figshare.com/articles/ERR552106_2_fastq_gz/7887242
2) For the snpEff H37Rv reference database to be included in the version 4.1 of the tool, to enable annotation on MTBC, I have also included a link to a binary version of the refernce genome annotation: https://figshare.com/articles/snpEffectPredictor_bin/7887230
Essentially, after installation of SnpEff version 4.1, the user will need to download this binary file. create a folder called NC_000962 in the data folder of snpEff and place the binary file in the NC_000962 file path.
I guess when you try it out, you will tell if it is better to write a small script to acheive this, or to document some quick instructions on how to go about it.
Hi @mezewudo sorry for disappearing for a couple of weeks. I realized that I had been a little bit more heavy-handed than necessary with some of the refactoring that I had done for this pull-request. Specifically, I had removed several attributes from the snp
object in UVP.py
(now uvp/snp.py
) that are set to store paths to various tools (self.__bwa
, self.__samtools
, self.__kraken
, etc.).
I realized that it would be less disruptive to leave those attributes in place and simply re-define their values to the names of the various tools, which should be available on the $PATH
if they are supplied by a conda environment.
I've downloaded your test files and will update again soon with the results of my test run(s).
Hi @mezewudo. I've been doing some testing using your datasets. It's looking good, but in order to complete a run I'll need a 'known sites' snps.vcf
file. Do you have one available?
Hi Dan,
Here is a link to the snps.vcffile:
https://figshare.com/articles/Mycobacterium_tuberculosis_variations_list/5987341
On Fri, Apr 12, 2019 at 5:49 PM Dan Fornika notifications@github.com wrote:
Hi @mezewudo https://github.com/mezewudo. I've been doing some testing using your datasets. It's looking good, but in order to complete a run I'll need a 'known sites' snps.vcf file. Do you have one available?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CPTR-ReSeqTB/UVP/pull/19#issuecomment-482761689, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLYb3D4IF8yjKWxiXz-pzAk8DIfnLg-ks5vgSmlgaJpZM4bxgYY .
The pipeline mostly seems to be running well but I am running into an issue with GATK BaseRecalibrator
. It's telling me that it's running out of memory despite giving it up to 24GB of memory.
The stderr log does say that it's picking up the JAVA_TOOL_OPTIONS=-Xmx24g
environment variable but I'm not certain that it's having the desired effect.
---[ BaseRecalibrator ]---
Command:
env JAVA_TOOL_OPTIONS="-Xmx24g" gatk -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Results/output/t
Standard Error:
Picked up JAVA_TOOL_OPTIONS: "-Xmx24g"
INFO 14:49:27,518 HelpFormatter - ----------------------------------------------------------------------------------
INFO 14:49:27,520 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
INFO 14:49:27,521 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 14:49:27,521 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
INFO 14:49:27,521 HelpFormatter - [Mon Apr 15 14:49:27 PDT 2019] Executing on Linux 3.10.0-229.14.1.el7.x86_64 amd64
INFO 14:49:27,521 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_192-b01 JdkDeflater
INFO 14:49:27,524 HelpFormatter - Program Args: -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Resu
INFO 14:49:27,536 HelpFormatter - Executing as dfornika@sabin.jgardy.bcgsc.ca on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01.
INFO 14:49:27,537 HelpFormatter - Date/Time: 2019/04/15 14:49:27
INFO 14:49:27,537 HelpFormatter - ----------------------------------------------------------------------------------
INFO 14:49:27,537 HelpFormatter - ----------------------------------------------------------------------------------
INFO 14:49:27,587 GenomeAnalysisEngine - Strictness is SILENT
INFO 14:49:27,697 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 14:49:27,703 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 14:49:27,740 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03
INFO 14:49:28,160 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine
INFO 14:49:28,304 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 14:49:28,310 GenomeAnalysisEngine - Done preparing for traversal
INFO 14:49:28,311 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 14:49:28,311 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 14:49:28,312 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime
INFO 14:49:28,345 BaseRecalibrator - The covariates being used here:
INFO 14:49:28,346 BaseRecalibrator - ReadGroupCovariate
INFO 14:49:28,346 BaseRecalibrator - QualityScoreCovariate
INFO 14:49:28,346 BaseRecalibrator - ContextCovariate
INFO 14:49:28,346 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 14:49:28,346 BaseRecalibrator - CycleCovariate
INFO 14:49:28,365 ReadShardBalancer$1 - Loading BAM index data
INFO 14:49:28,366 ReadShardBalancer$1 - Done loading BAM index data
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.
##### ERROR
I suspect that this comment might offer a clue:
https://gatkforums.broadinstitute.org/gatk/discussion/comment/8676/#Comment_8676
I've tried running the pipeline using the reference and vcf files from the nextstrain tb tutorial data and the BaseRecalibrator
step did complete successfully this time.
---[ BaseRecalibrator ]---
Command:
env JAVA_TOOL_OPTIONS="-Xmx24g" gatk -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Results/output/t
Standard Output:
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------
Standard Error:
Picked up JAVA_TOOL_OPTIONS: "-Xmx24g"
INFO 13:49:14,408 HelpFormatter - ----------------------------------------------------------------------------------
INFO 13:49:14,409 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
INFO 13:49:14,410 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 13:49:14,410 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
INFO 13:49:14,410 HelpFormatter - [Tue Apr 16 13:49:14 PDT 2019] Executing on Linux 3.10.0-229.14.1.el7.x86_64 amd64
INFO 13:49:14,410 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_192-b01 JdkDeflater
INFO 13:49:14,413 HelpFormatter - Program Args: -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Resu
INFO 13:49:14,426 HelpFormatter - Executing as dfornika@sabin.jgardy.bcgsc.ca on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01.
INFO 13:49:14,426 HelpFormatter - Date/Time: 2019/04/16 13:49:14
INFO 13:49:14,427 HelpFormatter - ----------------------------------------------------------------------------------
INFO 13:49:14,427 HelpFormatter - ----------------------------------------------------------------------------------
INFO 13:49:14,450 GenomeAnalysisEngine - Strictness is SILENT
INFO 13:49:14,587 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 13:49:14,594 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 13:49:14,622 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03
INFO 13:49:16,010 RMDTrackBuilder - Writing Tribble index to disk for file /home/dfornika/code/UVP/uvp/data/snps.vcf.idx
INFO 13:49:16,036 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 4 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine
INFO 13:49:16,095 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 13:49:16,099 GenomeAnalysisEngine - Done preparing for traversal
INFO 13:49:16,100 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 13:49:16,100 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 13:49:16,100 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime
INFO 13:49:16,124 BaseRecalibrator - The covariates being used here:
INFO 13:49:16,125 BaseRecalibrator - ReadGroupCovariate
INFO 13:49:16,125 BaseRecalibrator - QualityScoreCovariate
INFO 13:49:16,125 BaseRecalibrator - ContextCovariate
INFO 13:49:16,125 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 13:49:16,125 BaseRecalibrator - CycleCovariate
INFO 13:49:16,128 ReadShardBalancer$1 - Loading BAM index data
INFO 13:49:16,129 ReadShardBalancer$1 - Done loading BAM index data
INFO 13:49:46,203 ProgressMeter - MTB_anc:2702205 700009.0 30.0 s 43.0 s 61.3% 48.0 s 18.0 s
INFO 13:49:59,798 BaseRecalibrator - Calculating quantized quality scores...
INFO 13:49:59,830 BaseRecalibrator - Writing recalibration report...
INFO 13:50:00,698 BaseRecalibrator - ...done!
INFO 13:50:00,698 BaseRecalibrator - BaseRecalibrator was able to recalibrate 1302711 reads
INFO 13:50:00,699 ProgressMeter - done 1302715.0 44.0 s 34.0 s 100.0% 44.0 s 0.0 s
INFO 13:50:00,699 ProgressMeter - Total runtime 44.60 secs, 0.74 min, 0.01 hours
INFO 13:50:00,699 MicroScheduler - 41447 reads were filtered out during the traversal out of approximately 1344162 total reads (3.08%)
INFO 13:50:00,699 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 13:50:00,700 MicroScheduler - -> 10455 reads (0.78% of total) failing DuplicateReadFilter
INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 13:50:00,700 MicroScheduler - -> 30992 reads (2.31% of total) failing MappingQualityZeroFilter
INFO 13:50:00,701 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 13:50:00,701 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
I will try next week to pull down your version of the tool and run locally to see how it runs. I may have to reach back to you for the copy, as the changes proposed are on the master branch and there is no dev branch.
On Tue, Apr 16, 2019 at 5:07 PM Dan Fornika notifications@github.com wrote:
I've tried running the pipeline using the reference and vcf files from the nextstrain tb tutorial data https://nextstrain.org/docs/getting-started/tb-tutorial#download-data and the BaseRecalibrator step did complete successfully this time.
---[ BaseRecalibrator ]--- Command: env JAVA_TOOL_OPTIONS="-Xmx24g" gatk -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Results/output/t
Standard Output:
Done. There were no warn messages.
Standard Error: Picked up JAVA_TOOL_OPTIONS: "-Xmx24g" INFO 13:49:14,408 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,409 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 INFO 13:49:14,410 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 13:49:14,410 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk INFO 13:49:14,410 HelpFormatter - [Tue Apr 16 13:49:14 PDT 2019] Executing on Linux 3.10.0-229.14.1.el7.x86_64 amd64 INFO 13:49:14,410 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_192-b01 JdkDeflater INFO 13:49:14,413 HelpFormatter - Program Args: -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Resu INFO 13:49:14,426 HelpFormatter - Executing as dfornika@sabin.jgardy.bcgsc.ca on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01. INFO 13:49:14,426 HelpFormatter - Date/Time: 2019/04/16 13:49:14 INFO 13:49:14,427 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,427 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,450 GenomeAnalysisEngine - Strictness is SILENT INFO 13:49:14,587 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:49:14,594 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:49:14,622 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 13:49:16,010 RMDTrackBuilder - Writing Tribble index to disk for file /home/dfornika/code/UVP/uvp/data/snps.vcf.idx INFO 13:49:16,036 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 4 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine INFO 13:49:16,095 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:49:16,099 GenomeAnalysisEngine - Done preparing for traversal INFO 13:49:16,100 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:49:16,100 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:49:16,100 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 13:49:16,124 BaseRecalibrator - The covariates being used here: INFO 13:49:16,125 BaseRecalibrator - ReadGroupCovariate INFO 13:49:16,125 BaseRecalibrator - QualityScoreCovariate INFO 13:49:16,125 BaseRecalibrator - ContextCovariate INFO 13:49:16,125 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 13:49:16,125 BaseRecalibrator - CycleCovariate INFO 13:49:16,128 ReadShardBalancer$1 - Loading BAM index data INFO 13:49:16,129 ReadShardBalancer$1 - Done loading BAM index data INFO 13:49:46,203 ProgressMeter - MTB_anc:2702205 700009.0 30.0 s 43.0 s 61.3% 48.0 s 18.0 s INFO 13:49:59,798 BaseRecalibrator - Calculating quantized quality scores... INFO 13:49:59,830 BaseRecalibrator - Writing recalibration report... INFO 13:50:00,698 BaseRecalibrator - ...done! INFO 13:50:00,698 BaseRecalibrator - BaseRecalibrator was able to recalibrate 1302711 reads INFO 13:50:00,699 ProgressMeter - done 1302715.0 44.0 s 34.0 s 100.0% 44.0 s 0.0 s INFO 13:50:00,699 ProgressMeter - Total runtime 44.60 secs, 0.74 min, 0.01 hours INFO 13:50:00,699 MicroScheduler - 41447 reads were filtered out during the traversal out of approximately 1344162 total reads (3.08%) INFO 13:50:00,699 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 13:50:00,700 MicroScheduler - -> 10455 reads (0.78% of total) failing DuplicateReadFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 13:50:00,700 MicroScheduler - -> 30992 reads (2.31% of total) failing MappingQualityZeroFilter INFO 13:50:00,701 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 13:50:00,701 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/CPTR-ReSeqTB/UVP/pull/19#issuecomment-483843720, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLYb3hVY5jf6dzOzr38P2qMrR1uKlO5ks5vhjuBgaJpZM4bxgYY .
Dan,
Could you send me a link to the repo of this reworked version, that I can clone and test out locally? My hope is to try and install and test out on my end to check on any issues with the BaseCalibrator etc, and if everything pans out well, It will just be a question of accepting and merging the entire changes at once.
On Tue, Apr 16, 2019 at 2:07 PM Dan Fornika notifications@github.com wrote:
I've tried running the pipeline using the reference and vcf files from the nextstrain tb tutorial data https://nextstrain.org/docs/getting-started/tb-tutorial#download-data and the BaseRecalibrator step did complete successfully this time.
---[ BaseRecalibrator ]--- Command: env JAVA_TOOL_OPTIONS="-Xmx24g" gatk -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Results/output/t
Standard Output:
Done. There were no warn messages.
Standard Error: Picked up JAVA_TOOL_OPTIONS: "-Xmx24g" INFO 13:49:14,408 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,409 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 INFO 13:49:14,410 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 13:49:14,410 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk INFO 13:49:14,410 HelpFormatter - [Tue Apr 16 13:49:14 PDT 2019] Executing on Linux 3.10.0-229.14.1.el7.x86_64 amd64 INFO 13:49:14,410 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_192-b01 JdkDeflater INFO 13:49:14,413 HelpFormatter - Program Args: -T BaseRecalibrator -I Results/output/tmp/GATK/GATK_sdrc.bam -R Results/output/tmp/bwa/index/ref.fa --knownSites /home/dfornika/code/UVP/uvp/data/snps.vcf -o Resu INFO 13:49:14,426 HelpFormatter - Executing as dfornika@sabin.jgardy.bcgsc.ca on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01. INFO 13:49:14,426 HelpFormatter - Date/Time: 2019/04/16 13:49:14 INFO 13:49:14,427 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,427 HelpFormatter - ---------------------------------------------------------------------------------- INFO 13:49:14,450 GenomeAnalysisEngine - Strictness is SILENT INFO 13:49:14,587 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:49:14,594 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:49:14,622 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 13:49:16,010 RMDTrackBuilder - Writing Tribble index to disk for file /home/dfornika/code/UVP/uvp/data/snps.vcf.idx INFO 13:49:16,036 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 4 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine INFO 13:49:16,095 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:49:16,099 GenomeAnalysisEngine - Done preparing for traversal INFO 13:49:16,100 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:49:16,100 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:49:16,100 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 13:49:16,124 BaseRecalibrator - The covariates being used here: INFO 13:49:16,125 BaseRecalibrator - ReadGroupCovariate INFO 13:49:16,125 BaseRecalibrator - QualityScoreCovariate INFO 13:49:16,125 BaseRecalibrator - ContextCovariate INFO 13:49:16,125 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 13:49:16,125 BaseRecalibrator - CycleCovariate INFO 13:49:16,128 ReadShardBalancer$1 - Loading BAM index data INFO 13:49:16,129 ReadShardBalancer$1 - Done loading BAM index data INFO 13:49:46,203 ProgressMeter - MTB_anc:2702205 700009.0 30.0 s 43.0 s 61.3% 48.0 s 18.0 s INFO 13:49:59,798 BaseRecalibrator - Calculating quantized quality scores... INFO 13:49:59,830 BaseRecalibrator - Writing recalibration report... INFO 13:50:00,698 BaseRecalibrator - ...done! INFO 13:50:00,698 BaseRecalibrator - BaseRecalibrator was able to recalibrate 1302711 reads INFO 13:50:00,699 ProgressMeter - done 1302715.0 44.0 s 34.0 s 100.0% 44.0 s 0.0 s INFO 13:50:00,699 ProgressMeter - Total runtime 44.60 secs, 0.74 min, 0.01 hours INFO 13:50:00,699 MicroScheduler - 41447 reads were filtered out during the traversal out of approximately 1344162 total reads (3.08%) INFO 13:50:00,699 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 13:50:00,700 MicroScheduler - -> 10455 reads (0.78% of total) failing DuplicateReadFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 13:50:00,700 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 13:50:00,700 MicroScheduler - -> 30992 reads (2.31% of total) failing MappingQualityZeroFilter INFO 13:50:00,701 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 13:50:00,701 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CPTR-ReSeqTB/UVP/pull/19#issuecomment-483843720, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLYb3hVY5jf6dzOzr38P2qMrR1uKlO5ks5vhjuBgaJpZM4bxgYY .
My changes are in the portable
branch of this repository: https://github.com/dfornika/UVP.git
I think you should be able to clone that repository and checkout
the portable
branch.
I've included instructions for setting up a conda environment with all dependencies on the README.md
of that branch:
https://github.com/dfornika/UVP/blob/portable/README.md
Those instructions don't include the snpEff
database step we discussed above, so that would need to be done separately. Thanks for your co-operation and let me know if you run into any issues either here or by email: dan.fornika [at] bccdc.ca
I'd like to propose a few changes that may help with #7 .
environment.yml
file that lists add dependencies so they can be pulled from anaconda.org (NOTE: swappedfastqValidator
withfqtools validate
becausefastqValidator
isn't available on bioconda. I'm attempting to add it here: https://github.com/bioconda/bioconda-recipes/pull/12319)PATH
by the conda environment.