broadinstitute / Drop-seq

Java tools for analyzing Drop-seq data
MIT License
120 stars 34 forks source link

Dropulation execution errors #422

Open drneavin opened 6 months ago

drneavin commented 6 months ago

Hello,

I'm having trouble getting dropulation in Drop-seq v3.0.0 and v3.0.1 to work. For v3.0.0 I receive the errors when I try to run:

AssignCellsToSamples: 40: Syntax error: "(" unexpected

and for v3.0.1 I get the following error:

Illegal argument value: Positional arguments were provided ',barcodes.tsv}' but no positional argument is defined for this tool.

I'm not sure if this is an issue on my end or is an issue that anyone trying to use these functions will come across. I'm using java v21 and building in a singularity image - Ubuntu 20.04. Happy to provide more details to help resolve the issue.

mschilli87 commented 6 months ago

I could assist in testing. I have been using Drop-seq v3.x but I have not used dropulation from that version as I rely on Demuxafy for my droplet detection and demultiplexing needs (which is probably the reason @drneavin is testing out the latest version of Drop-seq/dropulation). 😅

alecw commented 6 months ago

@drneavin , please provide reproduction procedure, i.e. the command line you invoked.

drneavin commented 6 months ago

Thanks @mschilli87! Would be great to have help testing this. @alecw , sorry for not posting this before. The command I'm running for v3.0.1 is:

AssignCellsToSamples --CELL_BC_FILE $BARCODES \
            --INPUT_BAM $OUTDIR/../../v2.1.0/output/dropulation/possorted_genome_bam_dropulation_tag.bam \
            --OUTPUT $DROPULATION_OUTDIR/assignments.tsv.gz \
            --VCF $OUTDIR/../../v2.1.0/output/dropulation/invcf.gz \
            --SAMPLE_FILE $INDS \
            --CELL_BARCODE_TAG 'CB' \
            --MOLECULAR_BARCODE_TAG 'UB' \
            --VCF_OUTPUT $DROPULATION_OUTDIR/assignment.vcf \
            --MAX_ERROR_RATE 0.05 \
            --TMP_DIR $DROPULATION_OUTDIR 

For v3.0.0, the command that causes the positional error is:

AssignCellsToSamples --version

Let me know if there's anything else that could help with troubleshooting.

mschilli87 commented 6 months ago

If I run

AssignCellsToSamples --version

using Drop-seq v. 3.0.0, I get

AssignCellsToSamples: 40: ./bin/drop-seq_tools/AssignCellsToSamples: Syntax error: "(" unexpected

because of the #!/bin/sh shebang that resolved to dash on my Ubuntu system and that version features the

function usage () {
[...]
}

syntax not supported by POSIX shells.

Explicitely calling

bash $(which AssignCellsToSamples) --version

results in the expected

Version:3.0.0

I do remember fixing (or at least: trying to do so) this with a PR. I'll check and report back.

@drneavin: Can you confirm that /bin/sh is not bash on your system and explicitely using bash to call the (wrapper) script fixes the issue?


edit: I was talking about PR #398.

alecw commented 6 months ago

I'm guessing that there is whitespace in one of the shell variables. Try wrapping all you variables in double quotes, i.e.

AssignCellsToSamples --CELL_BC_FILE "$BARCODES" \
            --INPUT_BAM "$OUTDIR"/../../v2.1.0/output/dropulation/possorted_genome_bam_dropulation_tag.bam \
            --OUTPUT "$DROPULATION_OUTDIR"/assignments.tsv.gz \
            --VCF "$OUTDIR"/../../v2.1.0/output/dropulation/invcf.gz \
            --SAMPLE_FILE "$INDS" \
            --CELL_BARCODE_TAG 'CB' \
            --MOLECULAR_BARCODE_TAG 'UB' \
            --VCF_OUTPUT "$DROPULATION_OUTDIR"/assignment.vcf \
            --MAX_ERROR_RATE 0.05 \
            --TMP_DIR "$DROPULATION_OUTDIR"

You'll probably still get an error, but the message should be more informative.

mschilli87 commented 6 months ago

I checked and as expected, my PR #398 fixed the syntax error when using v. 3.0.1:

AssignCellsToSamples --version
USAGE: AssignCellsToSamples [arguments]

For a set of cells, and a VCF File,calculate the most likely sample assignment for each cell.  This is done bylooking at
each SNP/sample genotype, and calculating the likelihood of that resultgiven the pileup of reads at that genomic
location for each cell.  The log likelihoodsacross all genomic locations for each cell are calculated and summed.  Each
cell then has a likelihood across all SNPs for each of the samples in the VCF file.  The most likely sampleis assigned
to the cell, and the likelihood ratio of the sample assignment to the next best sample assignment is calculated to give
us a sense of how confident we are in the assignment.
Version:3.0.1

Required Arguments:

--CELL_BC_FILE <File>         Override NUM_CORE_BARCODES and process reads that have the cell barcodes in this file
                              instead.  The file has 1 column with no header.  Required.  Cannot be used in conjunction
                              with argument(s) NUM_BARCODES

--INPUT_BAM,-I <File>         The input SAM or BAM file to analyze. This argument can accept wildcards, or a file with
                              the suffix .bam_list that contains the locations of multiple BAM files  This argument must
                              be specified at least once. Required.

--NUM_BARCODES <Integer>      Number of cells that you want to extract from the BAM. The program will picks the top
                              <NUM_BARCODES> barcodes by read count.  Required.  Cannot be used in conjunction with
                              argument(s) CELL_BC_FILE

--OUTPUT,-O <File>            Output file of sample assignments. This supports zipped formats like gz and bz2.
                              Required.

--VCF <File>                  The input VCF file to analyze.  Required.

Optional Arguments:

--ADD_MISSING_VALUES <Boolean>on a per-snp basis, generate a mxiture of all the data across known samples to fill in
                              likelihoods for the missing samples.  Default value: true. Possible values: {true, false}

--ALLELE_FREQUENCY_ESTIMATE_FILE <File>
                              A file that contains an estimate of the allele frequency expected for each SNP across
                              donors. The best estimate of this will come from the fraction of reference and alternate
                              alleleUMIs that are observed at each snp site.  This report can be generated via
                              GatherDigitalAlleleCounts.  This is a fractional estimate between 0 and 1.  File is tab
                              seperated, with at least 3 columns:chromosome, position, maf_umi.  When supplied and
                              CELL_CONTAMINATION_ESTIMATE_FILE is provided, this modifies the likelihood error rates to
                              take into account how often the allele observed can be drawn from ambient RNA.  Default
                              value: null.

--ANSWER_KEY_FILE <File>      This file provides an answer key for the input BAM and VCF.  This does not change any of
                              the likelihood analysis results, but appends an additional column to the output that
                              contains the known sample assignment for a cell.This is useful when assessing how well the
                              method works, but is completely optional.  The format of the file is 2 tab seperated
                              columns with a header [cell       sample].  The first column contains the cell barcodes from
                              the BAM, the second the sample assignments from the VCF.  Default value: null.

--arguments_file <File>       read one or more arguments files and add them to the command line  This argument may be
                              specified 0 or more times. Default value: null.

--BAM_OUTPUT <File>           Output a version of the BAM containing only the reads that have coverage in the input BAM
                              and VCF.  Default value: null.

--CELL_BARCODE_TAG <String>   The cell barcode tag.  If there are no reads with this tag, the program will assume that
                              all reads belong to the same cell and process in single sample mode.  Default value: XC.

--CELL_CONTAMINATION_ESTIMATE_FILE <File>
                              A file that contains an estimate of how much ambient RNA is in each cell.  This is a
                              fractional estimate between 0 and 1.  File is tab seperated, with 2 columns:cell_barcode
                              and frac_contamination.  When supplied along side the ALLELE_FREQUENCY_ESTIMATE_FILE, this
                              modifies the likelihood error rates to take into account how oftenthe allele observed can
                              be drawn from ambient RNA. We use cellbender remove background
                              [https://github.com/broadinstitute/CellBender] to estimate the number of transcripts
                              before and after ambient cleanup to define the fraction of transcripts that come from
                              ambient RNA.  Default value: null.

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF).  Default value: 5.

--CREATE_INDEX <Boolean>      Whether to create an index when writing VCF or coordinate sorted BAM output.  Default
                              value: false. Possible values: {true, false}

--CREATE_MD5_FILE <Boolean>   Whether to create an MD5 digest for any BAM or FASTQ files created.    Default value:
                              false. Possible values: {true, false}

--DNA_MODE <Boolean>          EXPERIMENTAL!!! Run the program in DNA Mode.  In this mode, reads should have a cell
                              barcode, but will be missing gene annotations and UMIs.  All reads will be accepted as
                              passing, and each read (or read pair) will be treated as a single UMI  If the data is PCR
                              Duplicate marked, duplicate reads will be filtered.   Default value: false. Possible
                              values: {true, false}

--EDIT_DISTANCE <Integer>     The edit distance that molecular barcodes should be combined at within a gene/SNP.
                              Default value: 1.

--FIXED_ERROR_RATE <Double>   Instead of useing base qualities to determine error rate, use a fixed error rate instead.
                              This is rounded to the nearest phread score internally.  Default value: null.

--FRACTION_SAMPLES_PASSING <Double>
                              At least <FRACTION_SAMPLES_PASSING> samples must have genotype scores >= GQ_THRESHOLD for
                              the variant in the VCF to be included in the analysis.  Default value: 0.5.

--FUNCTION_TAG <String>       The functional annotation for the read.  If set, extracts the functional annotation(s)
                              [CODING/UTR/etc] at each SNP position and outputs in the verbose output.  Default value:
                              XF.

--FUNCTIONAL_STRATEGY <FunctionalDataProcessorStrategy>
                              A strategy for interpreting functional annotations.  DropSeq is the default strategy.
                              STARSOLO strategy priority is very similar to DropSeq, exceptin cases where a read
                              overlaps both an intron on the sense strand and a coding region on the antisense strand.
                              In these cases, DropSeq favors the intronic interpretation, while STARSolo interprets this
                              as a technical artifact and labels the read as coming from the antisense coding gene, and
                              the read does not contribute to the expression counts matrix.  Default value: DROPSEQ.
                              Possible values: {DROPSEQ, STARSOLO}

--GENE_FUNCTION_TAG <String>  Gene Function tag.  For a given gene name <GENE_NAME_TAG>, this is the function of the
                              gene at this read's position: UTR/CODING/INTRONIC/...  Default value: gf.

--GENE_NAME_TAG <String>      Gene Name tag.  Takes on the gene name this read overlaps (if any)  Default value: gn.

--GENE_STRAND_TAG <String>    Gene Strand tag.  For a given gene name <GENE_NAME_TAG>, this is the strand of the gene.
                              Default value: gs.

--GQ_THRESHOLD <Integer>      The minimum genotype quality for a variant.  Set this value to 0 to not filter by GQ
                              scores if they are present, or to -1 to completely ignore GQ values if they are not set in
                              the genotype info field.  If the GQ field is not set in the VCF header, this will be set
                              to -1 by default.  Default value: 30.

--help,-h <Boolean>           display the help message  Default value: false. Possible values: {true, false}

--IGNORED_CHROMOSOMES <String>A list of chromosomes to omit from the analysis.  The default is to omit the sex
                              chromosomes.  This argument may be specified 0 or more times. Default value: [X, Y, MT].

--LOCUS_FUNCTION_LIST <LocusFunction>
                              A list of functional annotations that reads need to be completely contained by to be
                              considered for analysis.  This argument may be specified 0 or more times. Default value:
                              [CODING, UTR]. Possible values: {INTERGENIC, INTRONIC, UTR, CODING, RIBOSOMAL}

--MAX_ERROR_RATE <Double>     Caps the base error rate at a maximum probability so no SNP can be weighed more than this
                              value.  For example, if this value was 0.01, then a base quality 30 value (normally an
                              erro rate of 0.001) would become 0.01.  With the same threshold, a base with an error rate
                              of 0.1 would be unaffected.  Default value: null.

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
                              in RAM before spilling to disk. Increasing this number reduces the number of file handles
                              needed to sort the file, and increases the amount of RAM needed.  Default value: 500000.

--MOLECULAR_BARCODE_TAG <String>
                              The molecular barcode tag.  Default value: XM.

--QUIET <Boolean>             Whether to suppress job-summary info on System.err.  Default value: false. Possible
                              values: {true, false}

--READ_MQ <Integer>           The map quality of the read to be included.  Default value: 10.

--REFERENCE_SEQUENCE,-R <File>Reference sequence file.  Default value: null.

--RETAIN_MONOMORPIC_SNPS <Boolean>
                              Should monomorphic SNPs across the population be retained?  Default value: false. Possible
                              values: {true, false}

--SAMPLE_FILE <File>          A file with a list of samples in the VCF to match up to cells in the BAM.  This subsets
                              the VCF into a smaller data set containing only the samples listed. The file has 1 column
                              with no header.  Default value: null.

--SNP_LOG_RATE <Integer>      Logging interval for SNP sequence pileup processing  Default value: 1000.

--STRAND_STRATEGY <StrandStrategy>
                              The strand strategy decides which reads will be used by analysis.  The SENSE strategy
                              requires the read and annotation to have the same strand.  The ANTISENSE strategy requires
                              the read and annotation to be on opposite strands.  The BOTH strategy is permissive, and
                              allows the read to be on either strand.  Default value: SENSE. Possible values: {SENSE,
                              ANTISENSE, BOTH}

--TMP_DIR <File>              One or more directories with space available to be used by this program for temporary
                              storage of working files  This argument may be specified 0 or more times. Default value:
                              null.

--TRANSCRIBED_SNP_FAIL_FAST_THRESHOLD <Integer>
                              (Optional)  If set to a value, this program will quit early with an exception if this
                              number of UMIs are observed without encountering a transcribed SNP - that is, a variant
                              from the VCF file that is will be informative during donor assignment.  A value of -1
                              disables this test.  As some fraction of UMIs may notcontain transcribed SNPs, the value
                              for this should probably be set to a fairly large number, like 10% of the total UMIs in
                              your experiment.  Mostly useful for debugging / or testing new data sets.  Default value:
                              -1.

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
                              Use the JDK Deflater instead of the Intel Deflater for writing compressed output  Default
                              value: false. Possible values: {true, false}

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
                              Use the JDK Inflater instead of the Intel Inflater for reading compressed input  Default
                              value: false. Possible values: {true, false}

--VALIDATION_STRINGENCY <ValidationStringency>
                              Validation stringency for all SAM files read by this program.  Setting stringency to
                              SILENT can improve performance when processing a BAM file in which variable-length data
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT.
                              Possible values: {STRICT, LENIENT, SILENT}

--VCF_OUTPUT <File>           Output a version of the VCF containing only the variants that have coverage in the input
                              BAM.  All samples are retained.  This file can be used as the VCF input file on subsequent
                              runs of the same data set to greatly speed up the run, if the same BAM is being analyzed
                              with different conditions.  If you plan on calling doublets with DetectDoublets, you'll
                              need this file.  Default value: null.

--VERBOSE_BEST_DONOR_OUTPUT <File>
                              Verbose output of every cell/SNP result for the BEST donor. This supports zipped formats
                              like gz and bz2.  Default value: null.

--VERBOSE_OUTPUT <File>       Verbose output of every cell/SNP result. This supports zipped formats like gz and bz2.
                              Default value: null.

--VERBOSITY <LogLevel>        Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING,
                              INFO, DEBUG}

--version <Boolean>           display the version number for this tool  Default value: false. Possible values: {true,
                              false}

Advanced Arguments:

--showHidden <Boolean>        display hidden arguments  Default value: false. Possible values: {true, false}

Argument INPUT_BAM was missing: Argument 'INPUT_BAM' is required

So while the version is reported now, the script does not exit early but instead continues which triggers the 'missing argument' error.

mschilli87 commented 6 months ago

@drneavin: Could you just echo your command to see all parameters expanded? Your error looks like there is some failed paramter expansion somehwere.

drneavin commented 6 months ago

Hi both,

For v3.0.0, @mschilli87, I think that makes sense. I don't have the singularity image with that version anymore but I don't see the same behaviour with 3.0.1 so I think your PR did fix that.

For v3.0.1, unfortunately, the double quotes didn't help - I still get the same error. And when I echo the command I don't see anything that's clearly incorrect:

AssignCellsToSamples --CELL_BC_FILE barcodes.tsv --INPUT_BAM ../../v2.1.0/output/dropulation/possorted_genome_bam_dropulation_tag.bam --OUTPUT dropulation/assignments.tsv.gz --VCF ../../v2.1.0/output/dropulation/invcf.gz --SAMPLE_FILE donor_list.txt --CELL_BARCODE_TAG CB --MOLECULAR_BARCODE_TAG UB --VCF_OUTPUT dropulation/assignment.vcf --MAX_ERROR_RATE 0.05 --TMP_DIR dropulation

Note that I edited this to remove complete paths from our HPC for the barcode and the sample file.

I also did a test where I set all the variables and ran using Drop-seq v2.5.4 from a previous singularity image which ran successfully and then changed just the singularity image to the one with Drop-seq v3.0.1 and resulted in the same positional argument error. This suggests to me that there isn't an issue with the variables unless the variable handling has been changed and it is more sensitive than before? Definitely open to suggestions and other interpretations.

I also changed the order of the options (ie moved --CELL_BC_FILE barcodes.tsv to later in the command) and found that it always complains about the first parameter set. So once --CELL_BC_FILE barcodes.tsv was moved to later in the command, then it complains about the bam file instead of the barcode file:

Illegal argument value: Positional arguments were provided ',/directflow/SCCGGroupShare/projects/DrewNeavin/Demultiplex_Benchmark/workflow_testing/v3.0.0/output/../../v2.1.0/output/dropulation/possorted_genome_bam_dropulation_tag.bam}' but no positional argument is defined for this tool.

This is the Demuxafy image with Drop-seq 3.0.1 installed if you wanted to test on your end to see if the result is the same or if you notice anything different about the setup or execution that are causing this problem:

wget -O Demuxafy.sif 'https://www.dropbox.com/scl/fi/owqniywk63nknynj6nrez/Demuxafy.sif?rlkey=nt1m99pvcaqjjho2wodjhx80u&st=eco4uozg'
mschilli87 commented 6 months ago

I am still suspecting an unescaped parameter expansion since the argument in the error message starts with a , and ends with a }. Not sure if this is on @drneavin's end or maybe in the wrapper shell script (in which case it might have been caused by my changes to the template). I'll try to find out and report back soon.

drneavin commented 6 months ago

Yes, I noticed that as well (the comma and the brace). But I double checked all my variables and none of them have a comma or a brace in them. Let me know if there's something else I can check? Would be great to get this working so I can include v3.0.1 in the next version of Demuxafy

mschilli87 commented 6 months ago

@drneavin: I would lime that very much so you definitely have me motivated to help. 😉

I had a look at the 3.0.1 shell scripts and I don't see any place where the AssignCellsToSamples wrapper is messing with arguments. AFAICT they are passed straight down to the dropseq wrapper. This one looks good as well AFAICT, but it does do some argument handling via shift/set. If there was anything wrong there, it would be a gradle bug though.

@alecw: Did you update gradle between 2.5.4 and 3.0.1 or can we rule out a gradle regression?

Else, this might be an error on the Java side for this specific command/class only. @drneavin: Can you reproduce that positional argument error with any other command?

Also reproducing it with a direct call to java would be useful to exclude the shell wrappers from suspicion.

Unfortunately I am not familiar enough with the gradle built system to git bisect this down to a specific commit. @alecw: If you give me some pointers I'd be happy to give it a shot as I have saved a lot of time thank to both your, and @drneavin's work.

drneavin commented 6 months ago

Thanks @mschilli87! Glad we're both dedicated to getting this to work!

And I think I've sorted it out 😄 . It seems that passing arguments as --VARIABLE <variable> doesn't work with v3.0.1. The parameters have to be passed as VARIABLE=<variable> instead. It should be noted that v2.5.4 allowed for the --VARIABLE passing and that the help options indicate that's how they should be passed still. We should either plan to revert it back to the previous ability or update the docs and the help menu. Probably the latter because it will match the other Drop-seq commands from my understanding

I'll leave this open so we can discuss and decide on the possible options for the parameter passing

alecw commented 6 months ago

Hi @drneavin, I was wondering if that would solve the problem. We still use the old-school OPTION=value style that I won't go into the history of. A bad decision I made ~15 years ago.

Anyway, I think there is another solution if you want to use the --OPTION style: If you put -- immediately after the AssignCellsToSamples and before any of the -OPTIONs, that will tell the shell getopt command not to get involved. I.e.

AssignCellsToSamples -- --CELL_BC_FILE $BARCODES \
            --INPUT_BAM $OUTDIR/../../v2.1.0/output/dropulation/possorted_genome_bam_dropulation_tag.bam \
            --OUTPUT $DROPULATION_OUTDIR/assignments.tsv.gz \
            --VCF $OUTDIR/../../v2.1.0/output/dropulation/invcf.gz \
            --SAMPLE_FILE $INDS \
            --CELL_BARCODE_TAG 'CB' \
            --MOLECULAR_BARCODE_TAG 'UB' \
            --VCF_OUTPUT $DROPULATION_OUTDIR/assignment.vcf \
            --MAX_ERROR_RATE 0.05 \
            --TMP_DIR $DROPULATION_OUTDIR 
drneavin commented 6 months ago

I don't mind either way honestly - I'm happy to move to using the OPTION=value. I just want it to be easy and clear for other users as well. I'll update the docs on my end to reflect this so anyone using it through Demuxafy shouldn't run into any issues. But maybe in the next release the docs could be updated to use the old-school nomenclature in the help sections of each command?

alecw commented 6 months ago

Unfortunately the argument parser is no longer something I control. I think the best solution would be to remove getopts from the wrapper scripts completely. The only really useful thing is to set Java heap size, and we could just tell users to export JAVA_OPTS=-Xmx<value> instead of the current -m

drneavin commented 6 months ago

Ok, no worries. I think the good thing is if anyone comes across this error, they should find it when they search the issues. So should be an easy resolution for anyone else who makes the same mistake I made. Happy for you to close out this issue when you're satisfied

mschilli87 commented 6 months ago

Personally I like not having to set environment variables. But I get your point and either way is fine. I just agree with @drneavin (isn't it super late where you are?) that the documentation should match and the rest is resolved by this issue exisiting on the internet.

ltalignani commented 1 month ago

I'm experiencing the same issue with ReduceGtf and ConvertToRefFlat, version 3.0.2, on Linux Ubuntu 20.04, but not on My Mac on macos 15 Sequoia. For the first program, I'm using the following command:

ReduceGtf --GTF resources/genomes/mus_musculus_39_112/curated_annotation.gtf --OUTPUT resources/genomes/mus_musculus_39_112/curated_reduced_annotation.gtf --SEQUENCE_DICTIONARY resources/genomes/mus_musculus_39_112/genome.dict --ENHANCE_GTF false --TMP_DIR /tmp

And the encountered error:

Illegal argument value: Positional arguments were provided ',resources/genomes/mus_musculus_39_112/curated_annotation.gtf}' but no positional argument is defined for this tool.

I also performed an echo on this command. No hidden characters appear.

For the second one, I'm using the command:

ConvertToRefFlat --ANNOTATIONS_FILE resources/genomes/mus_musculus_39_112/curated_annotation.gtf --OUTPUT resources/genomes/mus_musculus_39_112/curated_annotation.refFlat --SEQUENCE_DICTIONARY resources/genomes/mus_musculus_39_112/genome.dict --TMP_DIR /tmp

And the encountered error:

Illegal argument value: Positional arguments were provided ',resources/genomes/mus_musculus_39_112/curated_annotation.gtf}' but no positional argument is defined for this tool.

alecw commented 1 month ago

Hi @ltalignani ,

Have you tried the work-around I suggested in https://github.com/broadinstitute/Drop-seq/issues/422#issuecomment-2139518165 ?

-Alec

ltalignani commented 1 month ago

Hi @alecw,

I confirm that the -- solution works.