AdamaJava / adamajava

Other
14 stars 5 forks source link

tidy up qcoverage #294

Closed ChristinaXu2017 closed 2 years ago

ChristinaXu2017 commented 2 years ago

Description

  1. update the document with the new format
  2. remove unused messages inside the property file
  3. rename the options
  4. delete unused qtesting from the build file
  5. deprecate segment related options
  6. deprecate unused classes: Feature.java Main.java Segment.java Segmenter.java Shard.java
  7. LoadReferencedClasses removed
  8. There were eight type output combinations and they were confusing, It was hard to know whether the output formate is TXT, XML or VCF unless you run all these option combinations. For example:

java -jar qcoverage.jar -t phys --gff3 $gff3 --bam $bam --log $output.log --output $output [options]

  1. Here we use the new option "--output-format " to specify the output format explicitly. The above combination and outputs will be
    • options: nothing or "--output-format txt"; outputs name: $output.txt (TXT file size 24K)
    • options: "--output-format xml"; outputs name: $output.xml (XML file size 41K)
    • options: "--output-format vcf"; throw exception: "Only per-feature mode can produce VCF format output"
    • options: "--output-format vcf --output-format xml"; throw exception: "Only per-feature mode can produce VCF format output"
    • options: "--per-feature" or "--per-feature --output-format txt " ; outputs name: $ouptut.txt (TXT file size 229K)
    • options: "--per-feature --output-format xml"; outputs name: $ouptut.xml (XML file size 532K)
    • options: "--per-feature --output-format vcf"; outputs name: $output.vcf (VCF file size 6.8k)
    • options: "--per-feature --output-format vcf --output-format xml"; outputs name: $ouptut.xml (XML file size 532K), output.vcf (VCF file size 6.8k)

Overall, this tool options are updated and new usages show as below:

 usage: java -jar qcoverage.jar --type <type of coverage>  --input-bam <bam file> --input-gff3 <gff3 file> --output <output prefix> --log <log file> [options]
Option              Description                                                      
------              -----------                                                      
--help              Show usage and help.                                             
--input-bai         Opt, a BAI index file for the BAM file. Def=<input-bam>.bai.     
--input-bam         Req, a BAM input file containing the reads.                      
--input-gff3        Req, a GFF3 input file defining the features.                    
--log               Req, log file.                                                   
--loglevel          Opt, logging level [INFO,DEBUG], Def=INFO.                       
--output            Req, the output file path. Here, filename extension will         
                      automatically added.                                           
--output-format     Opt, specify output file format, multi values are allowed.       
                      Possible values: [VCF, TXT, XML]. Def=TXT.                     
--per-feature       Opt, to run the per-feature coverage mode. Default is to run     
                      standard coverage mode without this option.                    
--query             Opt, the query string for selecting reads for coverage.          
--thread <Integer>  Opt, number of worker threads (yields n+1 total threads).        
--type              Req, the type of coverage to perform. Possible Values: [seq,     
                      sequence, phys, physical].                                     
--validation        Opt, how strict to be when reading a SAM or BAM. Possible values:
                      [STRICT, LENIENT, SILENT].                                     
--version           Show version number.   

Please delete options that are not relevant.

How Has This Been Tested?

unit test is updated, tested on real data set

Are WDL Updates Required?

common/qcoverage.wdl and somaticDnaFastqToMaf.wdl will need to be updated to reflect new option names

Checklist:

holmeso commented 2 years ago

There is a bug package that has been added. Is this intentional?

ChristinaXu2017 commented 2 years ago

There is a bug package that has been added. Is this intentional?

what is it?

holmeso commented 2 years ago

There is a bug package that has been added. Is this intentional?

what is it?

There are a number of classes in a package called bug eg. qcoverage/bug/CoverageJobTest.java that you have committed as part of this PR.

ChristinaXu2017 commented 2 years ago

Thanks for detecting it. The "bug" is deleted now.

holmeso commented 2 years ago

The current Coverage.saveCoverageReport method does the following:

        if (options.hasXmlFlag()) {
            writeXMLCoverageReport(stats);
//          if (options.hasPerFeatureOption())
//              writeVCFReport(stats);
        } else if (options.hasPerFeatureOption()) {
            writePerFeatureTabDelimitedCoverageReport(stats);
        } else {
            writePerTypeTabDelimitedCoverageReport(stats);
        }

whereas the version in this PR does:

        if (options.hasXmlFlag()) {
            writeXMLCoverageReport(stats);
        }

        if (options.hasTxtFlag() ) {             
            writePerFeatureTabDelimitedCoverageReport(stats);
        } 

so it looks like the output behaviour is changing. Is this intentional? If so, could you please update the description in this PR to record what the new behaviour is, and why it is better than the old behaviour? I presume that the documentation has also been updated?

Also, could you please perform a check on the output file extension. The wdl currently sets the output with the xml file extension and if this is not changed, then the actual generated output will have a double xml suffix which could impact downstream processes. Thanks

ChristinaXu2017 commented 2 years ago

the pull request description is updated. Here we will throw exception if ask vcf output but not per-feature mode

holmeso commented 2 years ago

Thanks for the fileNameCorrection method in the Coverage class. This will prevent duplication of file extensions.

A few suggestions - perhaps you could raise an issues for these?

I think that the method would be better placed as a static method in the FileUtils class in the qcommon package. That way it could easily be used by other classes. If its static, then it is not going to modify any state, which makes it referentially transparent. Null guards should be put in place for the parameters. And finally, a unit test should be added.

holmeso commented 2 years ago

please make sure you make the necessary changes to the wdl files as soon as this is merged into master