Zymo-Research / miqScoreShotgunPublic

Repo for public-facing shotgun MIQ score
GNU General Public License v3.0
6 stars 5 forks source link

SOLVED: Taxa not appearing in radar/composition plots, but listed in the ReadFateTable #7

Open JulianMohr opened 2 years ago

JulianMohr commented 2 years ago

Hello community,

I recently performed a metagenomic analysis on sequencing data generated from a zymomock community by Nanopore Sequencing (MinION) and wanted to assess the sequencing bias by using this tool.

Unfortunately, the generated HTML report showed radar/composition plots without two of the ten analyzed taxa. In particular, the taxa Lactobacillus fermentum and Cryptococcus neoformans were not plotted, but instead listed in the ReadFateTable at the bottom of the report. L. fermentum makes up about 50 % of the total amount of reads, obviously indicating that the run was biased. Could that be the reason for the plotting behaviour?

The miqScoreShotgunPublic tool was executed using the following command:

docker container run \
            -v /path/to/miq_score_analysis:/data \
            -e SAMPLENAME=zymomock_run_20210818_guppy_5_0_16_sup \
            -e MODE=LONG \
            miqscoreshotgun

After the analysis completed successfully, the following report was generated:

radar_plots composition_plots_read_fate

JulianMohr commented 2 years ago

Since I was able to solve this issue by myself, I want to present my workaround to avoid the reported behaviour regarding the HTML report generation.

Unfortunately, when using it in LONG mode, the tool automatically assumes that you have sequenced the ZymoBIOMICS HMW DNA Standard (Cat. # D6322), which was especially designed for long read sequencing. However, we sequenced a ZymoBIOMICS Microbial Community Standard (Cat. # D6300). The difference between these two standards is the overall composition: The HMW DNA Standard does not contain the two species Lactobacillus fermentum and Cryptococcus neoformans, whereas the Microbial Community Standard does contain these. Therefore, the tool only picks the rermaining 8 species, when running in LONG mode, though it also detects the reads of the two species that would not be in a HMW standard. Furthermore, it calculates the MIQ Score by only considering the compositional properties of a HMW DNA Standard (gDNA abundance of 14 % for each bacterial species and 2 % for each yeast instead of 12 % for each bacterial species and 2 % for each yeast).

This is truly a limiting issue since you would not be able to use the LONG mode for the Microbial Community Standard, even though you had sequenced it using long read sequencing. Not using the LONG mode might be an option, if read mapping using BWA instead of minimap2 would be acceptable. In our case, we decided to modify the source code of the tool in order to enable the LONG mode for the Microbial Community Standard.

This can be achieved by changing the function getApplicationParametersLong() declared in the file analyzeStandardReads.py (located directly in the miqScoreShotgunPublic/ repository) from:

def getApplicationParametersLong():
    parameters = miqScoreShotgunPublicSupport.parameters.environmentParameterParser.EnvParameters()
    parameters.addParameter("sampleName", str, required=True, externalValidation=True)
    parameters.addParameter("maxReadCount", int, default=default.maxReadCount, lowerBound=0)
    parameters.addParameter("workingFolder", str, default=default.workingFolder, expectedDirectory=True)
    parameters.addParameter("reads", str, default = default.forwardReads, expectedFile=True)
    parameters.addParameter("sequenceFolder", str, default.sequenceFolder, expectedDirectory=True)
    parameters.addParameter("outputFolder", str, default=default.outputFolder, createdDirectory=True)
    parameters.addParameter("referenceGenome", str, default=default.referenceGenome, expectedFile=True)
    parameters.addParameter("fileNamingStandard", str, default="zymo", externalValidation=True)
    parameters.addParameter("referenceDataFile", str, default=default.referenceDataFileHMW, expectedFile=True)
    parameters.addParameter("bacteriaOnly", bool, required=False, default=False)
    if parameters.bacteriaOnly.value:
        print("ANALYZING BACTERIA ONLY")
        defaultBadExample = default.badMiqExampleBacteriaOnlyHMW
        defaultGoodExample = default.goodMiqExampleBacteriaOnlyHMW
    else:
        defaultBadExample = default.badMiqExampleHMW
        defaultGoodExample = default.goodMiqExampleHMW
    parameters.addParameter("goodMiqExample", str, default=defaultGoodExample, expectedFile=True)
    parameters.addParameter("badMiqExample", str, default=defaultBadExample, expectedFile=True)
    if not validSampleName(parameters.sampleName.value):
        logger.error("Invalid sample name given: %s" %parameters.sampleName.value)
        raise ValueError("Invalid sample name given: %s" %parameters.sampleName.value)
    parameters.checkCreatedFileStructures()
    return parameters

to

def getApplicationParametersLong():
    parameters = miqScoreShotgunPublicSupport.parameters.environmentParameterParser.EnvParameters()
    parameters.addParameter("sampleName", str, required=True, externalValidation=True)
    parameters.addParameter("maxReadCount", int, default=default.maxReadCount, lowerBound=0)
    parameters.addParameter("workingFolder", str, default=default.workingFolder, expectedDirectory=True)
    parameters.addParameter("reads", str, default = default.forwardReads, expectedFile=True)
    parameters.addParameter("sequenceFolder", str, default.sequenceFolder, expectedDirectory=True)
    parameters.addParameter("outputFolder", str, default=default.outputFolder, createdDirectory=True)
    parameters.addParameter("referenceGenome", str, default=default.referenceGenome, expectedFile=True)
    parameters.addParameter("fileNamingStandard", str, default="zymo", externalValidation=True)
    parameters.addParameter("referenceDataFile", str, default=default.referenceDataFile, expectedFile=True)  # change 1
    parameters.addParameter("bacteriaOnly", bool, required=False, default=False)
    if parameters.bacteriaOnly.value:
        print("ANALYZING BACTERIA ONLY")
        defaultBadExample = default.badMiqExampleBacteriaOnly  # change 2
        defaultGoodExample = default.goodMiqExampleBacteriaOnly  # change 3
    else:
        defaultBadExample = default.badMiqExample  # change 4
        defaultGoodExample = default.goodMiqExample  # change 5
    parameters.addParameter("goodMiqExample", str, default=defaultGoodExample, expectedFile=True)
    parameters.addParameter("badMiqExample", str, default=defaultBadExample, expectedFile=True)
    if not validSampleName(parameters.sampleName.value):
        logger.error("Invalid sample name given: %s" %parameters.sampleName.value)
        raise ValueError("Invalid sample name given: %s" %parameters.sampleName.value)
    parameters.checkCreatedFileStructures()
    return parameters

Note, that I have highlighted the modified code lines by adding # change 1-5 at the end of each line. In fact, you only need to remove the "HMW" suffix in the related commands.

After applying the above changes, the docker container (my prefered method of executing the tool) needs to be re-build:

cd miqScoreShotgunPublic/
docker build -t miqscoreshotgun .

Then, the analysis can be repeated and will result in a complete HTML report.

JulianMohr commented 2 years ago

Even though the above solution worked in my case, I would like to keep this issue open and request a feature:

The implementation of a new command line option in order to choose the type of standard (especially when using the LONG mode) would solve this issue completely.

shannoneagle commented 2 years ago

Thanks @JulianMohr for the fix! I agree it would be great if there was a command line option to select the type of standard.