CFSAN-Biostatistics / shigatyper

CFSAN Shigella Typing Pipeline
Other
14 stars 6 forks source link

`-n sample_name` prefix not honored when writing output files #13

Closed kapsakcj closed 1 year ago

kapsakcj commented 1 year ago

When I use the -n sample_name option with shigatyper 2.0.1, it does not seem to honor the string that I feed into the option.

Instead it seems to use the FASTQ file name prefix as the prefix for the shigatyper output files:

$ docker run --rm -v $PWD:/data -u $(id -u):$(id -g) staphb/shigatyper:2.0.1 shigatyper --R1 sample001_L001_R1_001.fastq.gz --R2 sample001_L001_R2_001.fastq.gz -n TEST_SAMPLE_ID
[CRITICAL::ShigaTyper] ShigaTyper v2.0.0
[M::mm_idx_gen::0.016*1.13] collected minimizers
[M::mm_idx_gen::0.025*1.47] sorted minimizers
[M::main::0.032*1.38] loaded/built the index for 95 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 95
[M::mm_idx_stat::0.033*1.36] distinct minimizers: 21938 (99.78% are singletons); average occurrences: 1.002; average spacing: 5.442; total length: 119664
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -d /tmp/tmpcvmo38fk/ShigellaRef5.mmi /usr/local/lib/python3.8/dist-packages/ShigaTyper-2.0.1-py3.8.egg/shigatyper/resources/ShigellaRef5.fasta
[M::main] Real time: 0.036 sec; CPU: 0.047 sec; Peak RSS: 0.006 GB
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::main::0.007*1.14] loaded/built the index for 95 target sequence(s)
[M::mm_mapopt_update::0.007*1.14] mid_occ = 1000
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 95
[M::mm_idx_stat::0.008*1.12] distinct minimizers: 21938 (99.78% are singletons); average occurrences: 1.002; average spacing: 5.442; total length: 119664
[M::worker_pipeline::2.899*1.93] mapped 175312 sequences
[M::worker_pipeline::4.339*2.24] mapped 175144 sequences
[M::worker_pipeline::5.663*2.55] mapped 174848 sequences
[M::worker_pipeline::6.413*2.45] mapped 155712 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -ax sr /tmp/tmpcvmo38fk/ShigellaRef5.mmi /data/sample001_L001_R1_001.fastq.gz /data/sample001_L001_R2_001.fastq.gz
[M::main] Real time: 6.416 sec; CPU: 15.712 sec; Peak RSS: 0.352 GB
[mpileup] 1 samples in 1 input files
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[warning] samtools mpileup option `g` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[mpileup] 1 samples in 1 input files
[CRITICAL::ShigaTyper.run.report] ** sample001 is predicted to be Shigella boydii serotype 14.**
[CRITICAL::ShigaTyper.run.report] this strain is ipaB+, suggesting that it retains the virulent invasion plasmid.
[CRITICAL::ShigaTyper.run.report] Writing hits to sample001-hits.tsv
[CRITICAL::ShigaTyper.run.report] Writing final serotype prediction to sample001.tsv
[CRITICAL::ShigaTyper.run.report] Complete in  9.017 milliseconds.
        Hit     Number of reads Length Covered  reference length        % covered       Number of variants      % accuracy
0       ipaH_c  597     775     780     99.4    6.0     99.2
1       ipaB    140     1384    1743    79.4    13.0    99.1
2       Sb14_wzx        219     1441    1458    98.8    0.0     100.0
3       Sb14_wzy        179     1146    1212    94.6    0.0     100.0
sample  prediction      ipaB    notes
sample001       Shigella boydii serotype 14     +       this strain is ipaB+, suggesting that it retains the virulent invasion plasmid.

$ ls
sample001-hits.tsv  sample001.tsv  sample001_L001_R1_001.fastq.gz  sample001_L001_R2_001.fastq.gz

In this example I would expect the 2 output files to be named: TEST_SAMPLE_ID-hits.tsv and TEST_SAMPLE_ID.tsv, but it seems to be using the FASTQ filename prefix to write the output files due to this line (I think?):

https://github.com/CFSAN-Biostatistics/shigatyper/blob/7865ef17cbac4f5fef91d5bb0dc0785b014d7f28/shigatyper/shigatyper.py#L476

crashfrog commented 1 year ago

Hi Curtis,

Yeah, looks like a bug. I’m currently traveling back from ASM-NGS so I’ll look into this more closely in a couple of days.

Justin On Oct 21, 2022 at 2:45 PM -0400, Curtis Kapsak @.***>, wrote:

When I use the -n sample_name option with shigatyper 2.0.1, it does not seem to honor the string that I feed into the option. Instead it seems to use the FASTQ file name prefix as the prefix for the shigatyper output files: $ docker run --rm -v $PWD:/data -u $(id -u):$(id -g) staphb/shigatyper:2.0.1 shigatyper --R1 sample001_L001_R1_001.fastq.gz --R2 sample001_L001_R2_001.fastq.gz -n TEST_SAMPLE_ID [CRITICAL::ShigaTyper] ShigaTyper v2.0.0 [M::mm_idx_gen::0.0161.13] collected minimizers [M::mm_idx_gen::0.0251.47] sorted minimizers [M::main::0.0321.38] loaded/built the index for 95 target sequence(s) [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 95 [M::mm_idx_stat::0.0331.36] distinct minimizers: 21938 (99.78% are singletons); average occurrences: 1.002; average spacing: 5.442; total length: 119664 [M::main] Version: 2.24-r1122 [M::main] CMD: minimap2 -d /tmp/tmpcvmo38fk/ShigellaRef5.mmi /usr/local/lib/python3.8/dist-packages/ShigaTyper-2.0.1-py3.8.egg/shigatyper/resources/ShigellaRef5.fasta [M::main] Real time: 0.036 sec; CPU: 0.047 sec; Peak RSS: 0.006 GB [WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index. [M::main::0.0071.14] loaded/built the index for 95 target sequence(s) [M::mm_mapopt_update::0.0071.14] mid_occ = 1000 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 95 [M::mm_idx_stat::0.0081.12] distinct minimizers: 21938 (99.78% are singletons); average occurrences: 1.002; average spacing: 5.442; total length: 119664 [M::worker_pipeline::2.8991.93] mapped 175312 sequences [M::worker_pipeline::4.3392.24] mapped 175144 sequences [M::worker_pipeline::5.6632.55] mapped 174848 sequences [M::worker_pipeline::6.413*2.45] mapped 155712 sequences [M::main] Version: 2.24-r1122 [M::main] CMD: minimap2 -ax sr /tmp/tmpcvmo38fk/ShigellaRef5.mmi /data/sample001_L001_R1_001.fastq.gz /data/sample001_L001_R2_001.fastq.gz [M::main] Real time: 6.416 sec; CPU: 15.712 sec; Peak RSS: 0.352 GB [mpileup] 1 samples in 1 input files Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid [warning] samtools mpileup option g is functional, but deprecated. Please switch to using bcftools mpileup in future. [mpileup] 1 samples in 1 input files [CRITICAL::ShigaTyper.run.report] sample001 is predicted to be Shigella boydii serotype 14. [CRITICAL::ShigaTyper.run.report] this strain is ipaB+, suggesting that it retains the virulent invasion plasmid. [CRITICAL::ShigaTyper.run.report] Writing hits to sample001-hits.tsv [CRITICAL::ShigaTyper.run.report] Writing final serotype prediction to sample001.tsv [CRITICAL::ShigaTyper.run.report] Complete in 9.017 milliseconds. Hit Number of reads Length Covered reference length % covered Number of variants % accuracy 0 ipaH_c 597 775 780 99.4 6.0 99.2 1 ipaB 140 1384 1743 79.4 13.0 99.1 2 Sb14_wzx 219 1441 1458 98.8 0.0 100.0 3 Sb14_wzy 179 1146 1212 94.6 0.0 100.0 sample prediction ipaB notes sample001 Shigella boydii serotype 14 + this strain is ipaB+, suggesting that it retains the virulent invasion plasmid.

$ ls sample001-hits.tsv sample001.tsv sample001_L001_R1_001.fastq.gz sample001_L001_R2_001.fastq.gz In this example I would expect the 2 output files to be named: TEST_SAMPLE_ID-hits.tsv and TEST_SAMPLE_ID.tsv, but it seems to be using the FASTQ filename prefix to write the output files due to this line (I think?): https://github.com/CFSAN-Biostatistics/shigatyper/blob/7865ef17cbac4f5fef91d5bb0dc0785b014d7f28/shigatyper/shigatyper.py#L476 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

kapsakcj commented 1 year ago

Thanks for looking into it. Safe travels!

crashfrog commented 1 year ago

Fixed in 2.0.2.

kapsakcj commented 1 year ago

Thanks on this issue as well. Can confirm it now honors the --name option in 2.0.2:

# shigatyper --R1 /data/SRR7738178_1.fastq.gz --R2 /data/SRR7738178_2.fastq.gz --name TESTTESTTEST
[CRITICAL::ShigaTyper] ShigaTyper v2.0.2
[M::mm_idx_gen::0.008*1.21] collected minimizers
[M::mm_idx_gen::0.013*1.70] sorted minimizers
[M::main::0.016*1.56] loaded/built the index for 95 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 95
[M::mm_idx_stat::0.017*1.54] distinct minimizers: 21938 (99.78% are singletons); average occurrences: 1.002; average spacing: 5.442; total length: 119664
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -d /tmp/tmpwmlmyohf/ShigellaRef5.mmi /usr/local/lib/python3.8/dist-packages/ShigaTyper-2.0.2-py3.8.egg/shigatyper/resources/ShigellaRef5.fasta
[M::main] Real time: 0.019 sec; CPU: 0.029 sec; Peak RSS: 0.006 GB
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::main::0.006*1.13] loaded/built the index for 95 target sequence(s)
[M::mm_mapopt_update::0.007*1.13] mid_occ = 1000
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 95
[M::mm_idx_stat::0.007*1.12] distinct minimizers: 21938 (99.78% are singletons); average occurrences: 1.002; average spacing: 5.442; total length: 119664
[M::worker_pipeline::2.348*2.44] mapped 206814 sequences
[M::worker_pipeline::3.465*2.85] mapped 207044 sequences
[M::worker_pipeline::4.542*3.05] mapped 206570 sequences
[M::worker_pipeline::5.530*3.17] mapped 206510 sequences
[M::worker_pipeline::6.117*2.98] mapped 173320 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -ax sr /tmp/tmpwmlmyohf/ShigellaRef5.mmi /data/SRR7738178_1.fastq.gz /data/SRR7738178_2.fastq.gz
[M::main] Real time: 6.123 sec; CPU: 18.212 sec; Peak RSS: 0.360 GB
[mpileup] 1 samples in 1 input files
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[warning] samtools mpileup option `g` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[mpileup] 1 samples in 1 input files
[CRITICAL::ShigaTyper.run.report] ** TESTTESTTEST is predicted to be Shigella sonnei form II.**
[CRITICAL::ShigaTyper.run.report] 
[CRITICAL::ShigaTyper.run.report] Writing hits to TESTTESTTEST-hits.tsv
        Hit     Number of reads Length Covered  reference length        % covered       Number of variants      % accuracy
0       ipaH_c  672     779     780     99.9    3.0     99.6
1       EclacY  49      267     1254    21.3    1.0     99.6
2       cadA    529     2100    2143    98.0    22.0    99.0
3       Ss_methylase    307     1830    1833    99.8    0.0     100.0
[CRITICAL::ShigaTyper.run.report] Writing final serotype prediction to TESTTESTTEST.tsv
sample  prediction      ipaB    notes
TESTTESTTEST    Shigella sonnei form II -
[CRITICAL::ShigaTyper.run.report] Complete in  4.679 milliseconds.

root@d458af82b7d6:/data# ls TESTTESTTEST*
TESTTESTTEST-hits.tsv  TESTTESTTEST.tsv