ChristopherWilks / megadepth

BigWig and BAM utilities
Other
91 stars 9 forks source link

Not generating TSVs for --coverage and --auc #17

Open mayank-singh-sikarwar opened 10 months ago

mayank-singh-sikarwar commented 10 months ago

Hi,

Thank you for creating this wonderful tool. I am using it in one of my nextflow workflows and I have noticed some unexpected behaviours which are as follows:

Image used from biocontanier repo with tags 1.2.0--h6ab5fc9_5 and 1.2.0--hff880f7_4.

  1. Only generates bw files whether use --coverage, --bigwig or both.
  2. Does not generate auc.tsv file even if the parameter is provided in the command.

Image used from broadsword

  1. Process throws an error saying "megadepth command does not exist". I have tested the first two images and it didnt work for both of them.

Nextflow is running in a ubuntu 22.04.3(jammy jellyfish) machine and I using singularity to execute the command given below.

    megadepth ${sortedBam} \
    --threads ${task.cpus} \
    --coverage \
    --no-head \
    --require-mdz \
    --min-unique-qual 10 \
    --frag-dist ${tempDir} \
    --annotation ${bedFile} \
    --bigwig \
    --auc \
    --alts \
    --junctions ${tempDir} \
    --prefix ${sampleId}

Thanks is advance for your help. It would be great help if you could provide any inputs on this. Lastly, the docker link given in the documention does not work properly.

ChristopherWilks commented 10 months ago

Hi @mayank-singh-sikarwar,

Thanks for the detailed bug report.

I've uploaded the latest (fixed) docker image for 1.2.0 to the public docker repo. I can't specifically speak to the other repo (biocontainer) since I don't maintain those).

You should be able to run the megadepth docker command via something like this command line:

docker run -v <filesystem2mount>:<container_mount_path> quay.io/broadsword/megadepth:1.2.0 <container_mount_path>/<BAM>  --threads 4  --coverage --no-head --require-mdz --min-unique-qual 10 --frag-dist --annotation <container_mount_path>/BED --auc --alts --junctions --prefix <container_mount_path>/<output_file_prefix> --no-auc-stdout --no-annotation-stdout --no-coverage-stdout --gzip

Note, you don't need to include any additional arguments for --frag-dist or --junctions, they'll just use the --prefix as the prefix to their output filename/path.

Also, what's your end use case for the various megadepth outputs?

I ask because that will affect which arguments you want to use in terms of output formats.

For example, if you pass --bigwig the full base-level coverage output will always only be written to a BigWig file no matter what other options are passed. Though this does not apply to the unique base-level coverage, which could also be written to a compressed TSV as well as a BigWig if --gzip is passed in as well as --bigwig and --min-unique-qual 10.

ChristopherWilks commented 10 months ago

(also, note the --no-auc-stdout --no-annotation-stdout --no-coverage-stdout --gzip in my command, those will 1) force auc, annotation coverage, and base coverage to files rather than standard out (default) as well as since there's no --bigwig and 2) will gzip the annotation and base level coverage files.

ChristopherWilks commented 10 months ago

actually, I can see there is one or more bugs around the non-bigwig base-level coverage output, at least when --min-unique-qual 10 is specified. Let me look into fixing that.

mayank-singh-sikarwar commented 10 months ago

Thanks for the quick response and uploading the new image. Sorry for being the bearer of the bad news. The latest image still does not recognise "megadepth" as a command. I am not sure if I am doing something wrong or is it the nextflow interpreting it differently. Although when I used the biocontainer image, the process dsose the accept megadepth from command line. Here is my process from nextflow file:

process BAM_COUNT {
    tag "read counts from bam-${sampleId}"
    container 'docker://quay.io/broadsword/megadepth:1.2.0'
    publishDir "$params.staging/$sampleId", pattern: '*', mode: 'copy'

    cpus 4
    memory '16 GB'

    input:
    val(sampleId)
    tuple path(sortedBam), path(sortedBai), path(idxstats)
    path bedFile

    output:
    path("*")

    script:
    """
    megadepth ${sortedBam} \
    --threads ${task.cpus} \
    --coverage \
    --no-head \
    --require-mdz \
    --min-unique-qual 10 \
    --frag-dist \
    --junction \
    --annotation ${bedFile} \
    --auc \
    --prefix ${sampleId} \
    --no-auc-stdout \
    --no-annotation-stdout \
    --no-coverage-stdout \
    --gzip
    """
}

I checked the Linux binary documentation and found out about the --no-auc-stdout --no-coverage-stdout flags which are also mentioned in the repo documentation.

My use case: For now I don't have a particular biological hypothesis. I came across your work in the recount3 paper(which is awesome). I am trying you recapitulate and tinker with important steps of monorail workflow. The monorail images provided are also really useful and easy to use but the pump step has a few extra steps, and unify is kind of a black box to me. It becomes really easy for me to understand the steps deeply if I try it by myself. For now my focus is to get the gene and exons sums prodused using megadepth but I am interested in junction information as well which I will use once I am satisfied with. Therefore, I was trying to generate tsvs for all and unique analysis type as they easy to handle and process and dont need extra tools.

Confirmation: I just want to confirm if I understnad this correctly. Megadepth will always generate bigwig file but not tsv file for full base-level coverage if --bigwig flag is passed. Question:

  1. Does this means that if only --coverage is passed, Megadepth will generate tsv for full base-level coverage? Also if --min-unique-qual is pased with it will generate tsv for all and well as unique base-level coverage.
  2. This is related to the image creation. Is it configured to take 'megadepth' as a command? I was looking in the docker files couldn't pin point which one you are using to build the docker image.
  3. The bugs you have found. does it affects the sum counting part or is it only affecting the file generation behaviour? Please let me know if I can of any help in fixing this.
ChristopherWilks commented 10 months ago

Hi @mayank-singh-sikarwar,

Thanks for the followup, good to know your goals with using megadepth (and monorail/recount3).

I'll have to look into running megadepth via the docker container question as I'm not sure at the moment why that wouldn't work in nextflow but does from my command line.

As far as your questions, I answer them in reverse order: 3 bug only affects the output formats generated, though in certain cases it will generate 0-byte files for the formats it's supposed to be generating and doesn't. I need to fix this obviously. The trick is supporting and testing the various combinations of arguments the user might choose which I haven't done a good job at when it comes to the --coverage related arguments.

2 the megadepth build is regrettably complex, but as far as docker, the build script is here: https://github.com/ChristopherWilks/megadepth/blob/master/create_docker_to_run_megadepth.sh and it uses the https://github.com/ChristopherWilks/megadepth/blob/master/Dockerfile.run as the Docker file.

1 I'll have to check again, but from what I saw recently megadepth will never generate a tsv for the full base-level read coverage for all reads case (no min-value cutoff); this is not expected behavior. However it will for the min-value cutoff argument in certain cases. This is partly due to the fact that tsv of base coverage was a later addon which wasn't a primary goal of the original megadepth (bamcount).

mayank-singh-sikarwar commented 10 months ago

Hi @ChristopherWilks ,

Thank you so much for replying to my silly questions and giving a look into this issue. Also, thanks for sharing build file link because now it makes all the sense. I have been looking at the wrong file(Dockerfile.build) instead of looking into Dockerfile.run which led me to use the command correctly in my nextflow process. I used your latest image with command similar to this "/megadepth <bam|bw> --options" and it works as expected. Just documenting this for future users, in case they end up here.

I would also like to share my observations and I will be grateful if you could share you inputs on the behaviour.

Parameters used:

# w/o parsing --bigwig
/megadepth ${sortedBam} \\
    --threads ${task.cpus} \\
    --prefix ${sampleId} \\
    --no-auc-stdout \\
    --no-annotation-stdout \\
    --no-coverage-stdout \\
    --coverage \\
    --no-head \\
    --require-mdz \\
    --annotation ${bedFile} \\
    --min-unique-qual 10 \\
    --auc \\
    --frag-dist \\
    --junctions \\

Observations:

  1. According to the params provided, Megadepth generates ${sampleId}.coverage.tsv, ${sampleId}.unique.tsv, ${sampleId}.jxs.tsv, ${sampleId}.frags.tsv, ${sampleId}.auc.tsv, ${sampleId}.annotation.tsv.
  2. As --biwwig param was not used, no file with bw extension was generated.
  3. ${sampleId}.coverage.tsv: you briefly mention in this [section](https://github.com/ChristopherWilks/megadepth#bamcram-processing-subcommands) that a file with a coverage.tsv suffix will be generated in case —bigwig is passed. a. Is it possible that it could true for just --coverage param as well?
  4. Megadepth raises segemntation fault (core dumped) when --gzip param is used. But there are work arounds to zip the results by piping the results from megadepth to gzip or zstd command.

I also wanted to confirm if my understanding is correct about the differeence between all.bw and unique.bw files. all.bw will contain base-level coverage for bases with any mapping quality whereas unique.bw will contains coverages for bases with greater than or equal to the mapping quality provided.

I believe the original bug is fixed and you have provided clarifications for the rest of my doubts. I will be happy to close this issue after you have provided your last inputs, in case you want to track another issues you have observed as another thread.