exomiser / Exomiser

A Tool to Annotate and Prioritize Exome Variants
https://exomiser.readthedocs.io
GNU Affero General Public License v3.0
202 stars 55 forks source link

TSV-writer appears to substitute any 'vcf' in file path #40

Closed buske closed 9 years ago

buske commented 9 years ago

It appears that anywhere the string vcf appears in the path of the output file, it is substituted with genes.tsv, even if it doesn't appear at the end (note that the vcf directory is changed to a genes.tsv directory which doesn't exist):

2015-02-09 02:24:01,429 INFO  de.charite.compbio.exomiser.core.writers.VcfResultsWriter [main] - VCF results written to file /dupa-filer/buske/phenomecentral/geno/vcf/F0000009/F0000009.vcf.
2015-02-09 02:24:01,431 ERROR de.charite.compbio.exomiser.core.writers.TsvGeneResultsWriter [main] - Unable to write results to file /dupa-filer/buske/phenomecentral/geno/genes.tsv/F0000009/F0000009.genes.tsv.
java.nio.file.NoSuchFileException: /dupa-filer/buske/phenomecentral/geno/genes.tsv/F0000009/F0000009.genes.tsv
julesjacobsen commented 9 years ago

I suspect that the vcf is also substituted, just that in this case vcf is the same as what's being substituted.

julesjacobsen commented 9 years ago

I can't replicate this from what you've given me.

In order to replicate the issue I created the following folder hierarchy:

C:\Users\jj8\Documents\test\vcf\pfeiffer\
C:\Users\jj8\Documents\test\vcf\pfeiffer\results\

This contains the file Pfeiffer.vcf

I analyse the vcf using this command:

java -Xms3G -Xmx4G -jar .\exomiser-cli-6.0.0.jar --vcf C:\Users\jj8\Documents\test\vcf\pfeiffer\Pfeiffer.vcf --prioritiser phive --out-file=C:\Users\jj8\Documents\test\vcf\pfeiffer\results\pfeiffer --out-format=HTML,VCF,TSV-GENE,TSV-VARIANT

and lo, when the analysis has finished I have the following files:

$ ls C:\Users\jj8\Documents\test\vcf\pfeiffer\results\

    Directory: C:\Users\jj8\Documents\test\vcf\pfeiffer\results

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---        09/02/2015     12:02     338184 pfeiffer.genes.tsv
-a---        09/02/2015     12:02   24857818 pfeiffer.html
-a---        09/02/2015     12:02    4634501 pfeiffer.variants.tsv
-a---        09/02/2015     12:02   11868447 pfeiffer.vcf

which is what was expected. Did you do something different?

buske commented 9 years ago

The difference is that the --out-file I provided had a .vcf suffix. I have to provide some suffix here, because otherwise Exomiser deletes whatever is after the last period and replaces it with .vcf (e.g. --out-file my.output.prefix results in my.output.vcf. If I provide --out-file path/to/out/vcf/file.vcf, it tries to generate path/to/out/genes.tsv/file.genes.tsv.

One potential solution that might clarify things would be to change --out-file to --out-prefix and not have any suffix-parsing/overwriting. In the meantime, I need to switch to specifying --out-file out.file.dummysuffix.

julesjacobsen commented 9 years ago

Indeed, in the interim you could always use hyphens instead of dots within the file name and use the dot to distinguish the file extension like a sane person would.

If you like I could look at implementing what they did for InterProScan5. Apparently no one ever complained about the file options for this.

These are the relevant options they have:

 -b,--output-file-base <OUTPUT-FILE-BASE>   Optional, base output filename
                                            (relative or absolute path).
                                            Note that this option, the
                                            --output-dir (-d) option and
                                            the --outfile (-o) option are
                                            mutually exclusive.  The
                                            appropriate file extension for
                                            the output format(s) will be
                                            appended automatically. By
                                            default the input file
                                            path/name will be used.
 -d,--output-dir <OUTPUT-DIR>               Optional, output directory.
                                            Note that this option, the
                                            --outfile (-o) option and the
                                            --output-file-base (-b) option
                                            are mutually exclusive. The
                                            output filename(s) are the
                                            same as the input filename,
                                            with the appropriate file
                                            extension(s) for the output
                                            format(s) appended
                                            automatically .

 -f,--formats <OUTPUT-FORMATS>              Optional, case-insensitive,
                                            comma separated list of output
                                            formats. Supported formats are
                                            TSV, XML, GFF3, HTML and SVG.
                                            Default for protein sequences
                                            are TSV, XML and GFF3, or for
                                            nucleotide sequences GFF3 and
                                            XML.

 -o,--outfile <EXPLICIT_OUTPUT_FILENAME>    Optional explicit output file
                                            name (relative or absolute
                                            path).  Note that this option,
                                            the --output-dir (-d) option
                                            and the --output-file-base
                                            (-b) option are mutually
                                            exclusive. If this option is
                                            given, you MUST specify a
                                            single output format using the
                                            -f option.  The output file
                                            name will not be modified.
                                            Note that specifying an output
                                            file name using this option
                                            OVERWRITES ANY EXISTING FILE.

It would still be helpful if you can give an example of the input settings you provided and the output you expected as I can add this to some tests to ensure the application does as expected.

buske commented 9 years ago

Haha touché. That said, the VCF files I get are often named things like 2013.08.23.11.07.16_GenomeSub_mcgill_vcf_316_HCSl_Marshfield06.flt.vcf, and, as per custom, I would usually try to add another suffix to the end with each processing step (e.g. file.flt.annotated.subset.vcf).

After much consternation, I settled on a more sensible output filename, and ran the following:

java -Xms2G -Xmx5G -jar /filer/tools/exomiser/exomiser-cli-6.0.0/exomiser-cli-6.0.0.jar \
  --min-qual 30 --max-freq 1.0 --out-format TAB-VARIANT \
  --prioritiser hiphive --keep-off-target false --keep-non-pathogenic false \
  --hpo-ids HP:0000047,HP:0000154,HP:0000219,HP:0000322,HP:0000325,HP:0000369,HP:0000445,HP:0000446,HP:0002474,HP:0002608,HP:0003189,HP:0005274,HP:0006889,HP:0009765 \
  --vcf /dupa-filer/buske/phenomecentral/geno/vcf/F0000009/2012.07.05.09.38.07_GenomeSub_mcgill_vcf_KB_174_81272.vcf \
  --out-file /dupa-filer/buske/phenomecentral/geno/vcf/F0000009/F0000009.vcf

(Technically, I ran it with --out-format TAB-GENE,TAB-VARIANT,VCF, but we'll ignore that for now) I was hoping it would generate the specified output file: /dupa-filer/buske/phenomecentral/geno/vcf/F0000009/F0000009.vcf Unfortunately, instead it tries to create: /dupa-filer/buske/phenomecentral/geno/variant.tsv/F0000009/F0000009.variant.tsv

IPS5's solution seems a bit like overkill to me. I'd still suggest the long-term solution be an --out-prefix PREFIX, where you then create PREFIX.vcf, PREFIX.variant.tsv, or any other suffixes that are specified by the out-formats. I dislike --out-file because at the end of the day, it doesn't really set the name of the output file, it's a suggestion that is only the actual output file if the suffix matches and only that out-format is specified. :)

buske commented 9 years ago

I've found the out-prefix pattern to be used pretty extensively, e.g. by samtools and plink.

julesjacobsen commented 9 years ago

OK so let's formalise this and I'll close this today. Basically whatever the input name or output prefixes are exomiser will simply append the specified output format. The exception being when no out-prefix is specified in which case exomiser-results is appended between the input filename and the output format file extension.

Given the exomiser settings with a specified out-prefix

--vcf /dupa-filer/buske/phenomecentral/geno/vcf/F0000009/2012.07.05.09.38.07_GenomeSub_mcgill_vcf_KB_174_81272.vcf 
--out-prefix /dupa-filer/buske/phenomecentral/geno/vcf/F0000009/F0000009.vcf
--out-format TAB-GENE,TAB-VARIANT,VCF,HTML

When exomiser writes out the results files Then they will be named:

/dupa-filer/buske/phenomecentral/geno/vcf/F0000009/F0000009.vcf.vcf
/dupa-filer/buske/phenomecentral/geno/vcf/F0000009/F0000009.vcf.genes.tsv
/dupa-filer/buske/phenomecentral/geno/vcf/F0000009/F0000009.vcf.variants.tsv
/dupa-filer/buske/phenomecentral/geno/vcf/F0000009/F0000009.vcf.html

Given the exomiser settings with specified out-prefix

--vcf /dupa-filer/buske/phenomecentral/geno/vcf/F0000009/2012.07.05.09.38.07_GenomeSub_mcgill_vcf_KB_174_81272.vcf 
--out-format TAB-GENE,TAB-VARIANT,VCF,HTML

When exomiser writes out the results files Then they will be named:

/dupa-filer/buske/phenomecentral/geno/vcf/F0000009/2012.07.05.09.38.07_GenomeSub_mcgill_vcf_KB_174_81272.vcf-exomiser-results.vcf
/dupa-filer/buske/phenomecentral/geno/vcf/F0000009/2012.07.05.09.38.07_GenomeSub_mcgill_vcf_KB_174_81272.vcf-exomiser-results.genes.tsv
/dupa-filer/buske/phenomecentral/geno/vcf/F0000009/2012.07.05.09.38.07_GenomeSub_mcgill_vcf_KB_174_81272.vcf-exomiser-results.variants.tsv
/dupa-filer/buske/phenomecentral/geno/vcf/F0000009/2012.07.05.09.38.07_GenomeSub_mcgill_vcf_KB_174_81272.vcf-exomiser-results.html
buske commented 9 years ago

@julesjacobsen This is great. Thanks, Jules!