Open drneavin opened 6 months ago
I could assist in testing. I have been using Drop-seq v3.x but I have not used dropulation from that version as I rely on Demuxafy for my droplet detection and demultiplexing needs (which is probably the reason @drneavin is testing out the latest version of Drop-seq/dropulation). 😅
@drneavin , please provide reproduction procedure, i.e. the command line you invoked.
Thanks @mschilli87! Would be great to have help testing this. @alecw , sorry for not posting this before. The command I'm running for v3.0.1 is:
AssignCellsToSamples --CELL_BC_FILE $BARCODES \
--INPUT_BAM $OUTDIR/../../v2.1.0/output/dropulation/possorted_genome_bam_dropulation_tag.bam \
--OUTPUT $DROPULATION_OUTDIR/assignments.tsv.gz \
--VCF $OUTDIR/../../v2.1.0/output/dropulation/invcf.gz \
--SAMPLE_FILE $INDS \
--CELL_BARCODE_TAG 'CB' \
--MOLECULAR_BARCODE_TAG 'UB' \
--VCF_OUTPUT $DROPULATION_OUTDIR/assignment.vcf \
--MAX_ERROR_RATE 0.05 \
--TMP_DIR $DROPULATION_OUTDIR
For v3.0.0, the command that causes the positional error is:
AssignCellsToSamples --version
Let me know if there's anything else that could help with troubleshooting.
If I run
AssignCellsToSamples --version
using Drop-seq v. 3.0.0, I get
AssignCellsToSamples: 40: ./bin/drop-seq_tools/AssignCellsToSamples: Syntax error: "(" unexpected
because of the #!/bin/sh
shebang that resolved to dash
on my Ubuntu system and that version features the
function usage () {
[...]
}
syntax not supported by POSIX shells.
Explicitely calling
bash $(which AssignCellsToSamples) --version
results in the expected
Version:3.0.0
I do remember fixing (or at least: trying to do so) this with a PR. I'll check and report back.
@drneavin: Can you confirm that /bin/sh
is not bash
on your system and explicitely using bash
to call the (wrapper) script fixes the issue?
edit: I was talking about PR #398.
I'm guessing that there is whitespace in one of the shell variables. Try wrapping all you variables in double quotes, i.e.
AssignCellsToSamples --CELL_BC_FILE "$BARCODES" \
--INPUT_BAM "$OUTDIR"/../../v2.1.0/output/dropulation/possorted_genome_bam_dropulation_tag.bam \
--OUTPUT "$DROPULATION_OUTDIR"/assignments.tsv.gz \
--VCF "$OUTDIR"/../../v2.1.0/output/dropulation/invcf.gz \
--SAMPLE_FILE "$INDS" \
--CELL_BARCODE_TAG 'CB' \
--MOLECULAR_BARCODE_TAG 'UB' \
--VCF_OUTPUT "$DROPULATION_OUTDIR"/assignment.vcf \
--MAX_ERROR_RATE 0.05 \
--TMP_DIR "$DROPULATION_OUTDIR"
You'll probably still get an error, but the message should be more informative.
I checked and as expected, my PR #398 fixed the syntax error when using v. 3.0.1:
AssignCellsToSamples --version
USAGE: AssignCellsToSamples [arguments]
For a set of cells, and a VCF File,calculate the most likely sample assignment for each cell. This is done bylooking at
each SNP/sample genotype, and calculating the likelihood of that resultgiven the pileup of reads at that genomic
location for each cell. The log likelihoodsacross all genomic locations for each cell are calculated and summed. Each
cell then has a likelihood across all SNPs for each of the samples in the VCF file. The most likely sampleis assigned
to the cell, and the likelihood ratio of the sample assignment to the next best sample assignment is calculated to give
us a sense of how confident we are in the assignment.
Version:3.0.1
Required Arguments:
--CELL_BC_FILE <File> Override NUM_CORE_BARCODES and process reads that have the cell barcodes in this file
instead. The file has 1 column with no header. Required. Cannot be used in conjunction
with argument(s) NUM_BARCODES
--INPUT_BAM,-I <File> The input SAM or BAM file to analyze. This argument can accept wildcards, or a file with
the suffix .bam_list that contains the locations of multiple BAM files This argument must
be specified at least once. Required.
--NUM_BARCODES <Integer> Number of cells that you want to extract from the BAM. The program will picks the top
<NUM_BARCODES> barcodes by read count. Required. Cannot be used in conjunction with
argument(s) CELL_BC_FILE
--OUTPUT,-O <File> Output file of sample assignments. This supports zipped formats like gz and bz2.
Required.
--VCF <File> The input VCF file to analyze. Required.
Optional Arguments:
--ADD_MISSING_VALUES <Boolean>on a per-snp basis, generate a mxiture of all the data across known samples to fill in
likelihoods for the missing samples. Default value: true. Possible values: {true, false}
--ALLELE_FREQUENCY_ESTIMATE_FILE <File>
A file that contains an estimate of the allele frequency expected for each SNP across
donors. The best estimate of this will come from the fraction of reference and alternate
alleleUMIs that are observed at each snp site. This report can be generated via
GatherDigitalAlleleCounts. This is a fractional estimate between 0 and 1. File is tab
seperated, with at least 3 columns:chromosome, position, maf_umi. When supplied and
CELL_CONTAMINATION_ESTIMATE_FILE is provided, this modifies the likelihood error rates to
take into account how often the allele observed can be drawn from ambient RNA. Default
value: null.
--ANSWER_KEY_FILE <File> This file provides an answer key for the input BAM and VCF. This does not change any of
the likelihood analysis results, but appends an additional column to the output that
contains the known sample assignment for a cell.This is useful when assessing how well the
method works, but is completely optional. The format of the file is 2 tab seperated
columns with a header [cell sample]. The first column contains the cell barcodes from
the BAM, the second the sample assignments from the VCF. Default value: null.
--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.
--BAM_OUTPUT <File> Output a version of the BAM containing only the reads that have coverage in the input BAM
and VCF. Default value: null.
--CELL_BARCODE_TAG <String> The cell barcode tag. If there are no reads with this tag, the program will assume that
all reads belong to the same cell and process in single sample mode. Default value: XC.
--CELL_CONTAMINATION_ESTIMATE_FILE <File>
A file that contains an estimate of how much ambient RNA is in each cell. This is a
fractional estimate between 0 and 1. File is tab seperated, with 2 columns:cell_barcode
and frac_contamination. When supplied along side the ALLELE_FREQUENCY_ESTIMATE_FILE, this
modifies the likelihood error rates to take into account how oftenthe allele observed can
be drawn from ambient RNA. We use cellbender remove background
[https://github.com/broadinstitute/CellBender] to estimate the number of transcripts
before and after ambient cleanup to define the fraction of transcripts that come from
ambient RNA. Default value: null.
--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.
--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}
--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}
--DNA_MODE <Boolean> EXPERIMENTAL!!! Run the program in DNA Mode. In this mode, reads should have a cell
barcode, but will be missing gene annotations and UMIs. All reads will be accepted as
passing, and each read (or read pair) will be treated as a single UMI If the data is PCR
Duplicate marked, duplicate reads will be filtered. Default value: false. Possible
values: {true, false}
--EDIT_DISTANCE <Integer> The edit distance that molecular barcodes should be combined at within a gene/SNP.
Default value: 1.
--FIXED_ERROR_RATE <Double> Instead of useing base qualities to determine error rate, use a fixed error rate instead.
This is rounded to the nearest phread score internally. Default value: null.
--FRACTION_SAMPLES_PASSING <Double>
At least <FRACTION_SAMPLES_PASSING> samples must have genotype scores >= GQ_THRESHOLD for
the variant in the VCF to be included in the analysis. Default value: 0.5.
--FUNCTION_TAG <String> The functional annotation for the read. If set, extracts the functional annotation(s)
[CODING/UTR/etc] at each SNP position and outputs in the verbose output. Default value:
XF.
--FUNCTIONAL_STRATEGY <FunctionalDataProcessorStrategy>
A strategy for interpreting functional annotations. DropSeq is the default strategy.
STARSOLO strategy priority is very similar to DropSeq, exceptin cases where a read
overlaps both an intron on the sense strand and a coding region on the antisense strand.
In these cases, DropSeq favors the intronic interpretation, while STARSolo interprets this
as a technical artifact and labels the read as coming from the antisense coding gene, and
the read does not contribute to the expression counts matrix. Default value: DROPSEQ.
Possible values: {DROPSEQ, STARSOLO}
--GENE_FUNCTION_TAG <String> Gene Function tag. For a given gene name <GENE_NAME_TAG>, this is the function of the
gene at this read's position: UTR/CODING/INTRONIC/... Default value: gf.
--GENE_NAME_TAG <String> Gene Name tag. Takes on the gene name this read overlaps (if any) Default value: gn.
--GENE_STRAND_TAG <String> Gene Strand tag. For a given gene name <GENE_NAME_TAG>, this is the strand of the gene.
Default value: gs.
--GQ_THRESHOLD <Integer> The minimum genotype quality for a variant. Set this value to 0 to not filter by GQ
scores if they are present, or to -1 to completely ignore GQ values if they are not set in
the genotype info field. If the GQ field is not set in the VCF header, this will be set
to -1 by default. Default value: 30.
--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}
--IGNORED_CHROMOSOMES <String>A list of chromosomes to omit from the analysis. The default is to omit the sex
chromosomes. This argument may be specified 0 or more times. Default value: [X, Y, MT].
--LOCUS_FUNCTION_LIST <LocusFunction>
A list of functional annotations that reads need to be completely contained by to be
considered for analysis. This argument may be specified 0 or more times. Default value:
[CODING, UTR]. Possible values: {INTERGENIC, INTRONIC, UTR, CODING, RIBOSOMAL}
--MAX_ERROR_RATE <Double> Caps the base error rate at a maximum probability so no SNP can be weighed more than this
value. For example, if this value was 0.01, then a base quality 30 value (normally an
erro rate of 0.001) would become 0.01. With the same threshold, a base with an error rate
of 0.1 would be unaffected. Default value: null.
--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 500000.
--MOLECULAR_BARCODE_TAG <String>
The molecular barcode tag. Default value: XM.
--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}
--READ_MQ <Integer> The map quality of the read to be included. Default value: 10.
--REFERENCE_SEQUENCE,-R <File>Reference sequence file. Default value: null.
--RETAIN_MONOMORPIC_SNPS <Boolean>
Should monomorphic SNPs across the population be retained? Default value: false. Possible
values: {true, false}
--SAMPLE_FILE <File> A file with a list of samples in the VCF to match up to cells in the BAM. This subsets
the VCF into a smaller data set containing only the samples listed. The file has 1 column
with no header. Default value: null.
--SNP_LOG_RATE <Integer> Logging interval for SNP sequence pileup processing Default value: 1000.
--STRAND_STRATEGY <StrandStrategy>
The strand strategy decides which reads will be used by analysis. The SENSE strategy
requires the read and annotation to have the same strand. The ANTISENSE strategy requires
the read and annotation to be on opposite strands. The BOTH strategy is permissive, and
allows the read to be on either strand. Default value: SENSE. Possible values: {SENSE,
ANTISENSE, BOTH}
--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.
--TRANSCRIBED_SNP_FAIL_FAST_THRESHOLD <Integer>
(Optional) If set to a value, this program will quit early with an exception if this
number of UMIs are observed without encountering a transcribed SNP - that is, a variant
from the VCF file that is will be informative during donor assignment. A value of -1
disables this test. As some fraction of UMIs may notcontain transcribed SNPs, the value
for this should probably be set to a fairly large number, like 10% of the total UMIs in
your experiment. Mostly useful for debugging / or testing new data sets. Default value:
-1.
--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}
--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}
--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}
--VCF_OUTPUT <File> Output a version of the VCF containing only the variants that have coverage in the input
BAM. All samples are retained. This file can be used as the VCF input file on subsequent
runs of the same data set to greatly speed up the run, if the same BAM is being analyzed
with different conditions. If you plan on calling doublets with DetectDoublets, you'll
need this file. Default value: null.
--VERBOSE_BEST_DONOR_OUTPUT <File>
Verbose output of every cell/SNP result for the BEST donor. This supports zipped formats
like gz and bz2. Default value: null.
--VERBOSE_OUTPUT <File> Verbose output of every cell/SNP result. This supports zipped formats like gz and bz2.
Default value: null.
--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}
--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}
Advanced Arguments:
--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}
Argument INPUT_BAM was missing: Argument 'INPUT_BAM' is required
So while the version is reported now, the script does not exit early but instead continues which triggers the 'missing argument' error.
@drneavin: Could you just echo
your command to see all parameters expanded? Your error looks like there is some failed paramter expansion somehwere.
Hi both,
For v3.0.0, @mschilli87, I think that makes sense. I don't have the singularity image with that version anymore but I don't see the same behaviour with 3.0.1 so I think your PR did fix that.
For v3.0.1, unfortunately, the double quotes didn't help - I still get the same error. And when I echo the command I don't see anything that's clearly incorrect:
AssignCellsToSamples --CELL_BC_FILE barcodes.tsv --INPUT_BAM ../../v2.1.0/output/dropulation/possorted_genome_bam_dropulation_tag.bam --OUTPUT dropulation/assignments.tsv.gz --VCF ../../v2.1.0/output/dropulation/invcf.gz --SAMPLE_FILE donor_list.txt --CELL_BARCODE_TAG CB --MOLECULAR_BARCODE_TAG UB --VCF_OUTPUT dropulation/assignment.vcf --MAX_ERROR_RATE 0.05 --TMP_DIR dropulation
Note that I edited this to remove complete paths from our HPC for the barcode and the sample file.
I also did a test where I set all the variables and ran using Drop-seq v2.5.4 from a previous singularity image which ran successfully and then changed just the singularity image to the one with Drop-seq v3.0.1 and resulted in the same positional argument error. This suggests to me that there isn't an issue with the variables unless the variable handling has been changed and it is more sensitive than before? Definitely open to suggestions and other interpretations.
I also changed the order of the options (ie moved --CELL_BC_FILE barcodes.tsv
to later in the command) and found that it always complains about the first parameter set. So once --CELL_BC_FILE barcodes.tsv
was moved to later in the command, then it complains about the bam file instead of the barcode file:
Illegal argument value: Positional arguments were provided ',/directflow/SCCGGroupShare/projects/DrewNeavin/Demultiplex_Benchmark/workflow_testing/v3.0.0/output/../../v2.1.0/output/dropulation/possorted_genome_bam_dropulation_tag.bam}' but no positional argument is defined for this tool.
This is the Demuxafy image with Drop-seq 3.0.1 installed if you wanted to test on your end to see if the result is the same or if you notice anything different about the setup or execution that are causing this problem:
wget -O Demuxafy.sif 'https://www.dropbox.com/scl/fi/owqniywk63nknynj6nrez/Demuxafy.sif?rlkey=nt1m99pvcaqjjho2wodjhx80u&st=eco4uozg'
I am still suspecting an unescaped parameter expansion since the argument in the error message starts with a ,
and ends with a }
. Not sure if this is on @drneavin's end or maybe in the wrapper shell script (in which case it might have been caused by my changes to the template). I'll try to find out and report back soon.
Yes, I noticed that as well (the comma and the brace). But I double checked all my variables and none of them have a comma or a brace in them. Let me know if there's something else I can check? Would be great to get this working so I can include v3.0.1 in the next version of Demuxafy
@drneavin: I would lime that very much so you definitely have me motivated to help. 😉
I had a look at the 3.0.1 shell scripts and I don't see any place where the AssignCellsToSamples
wrapper is messing with arguments. AFAICT they are passed straight down to the dropseq
wrapper. This one looks good as well AFAICT, but it does do some argument handling via shift
/set
. If there was anything wrong there, it would be a gradle bug though.
@alecw: Did you update gradle between 2.5.4 and 3.0.1 or can we rule out a gradle regression?
Else, this might be an error on the Java side for this specific command/class only. @drneavin: Can you reproduce that positional argument error with any other command?
Also reproducing it with a direct call to java
would be useful to exclude the shell wrappers from suspicion.
Unfortunately I am not familiar enough with the gradle built system to git bisect
this down to a specific commit. @alecw: If you give me some pointers I'd be happy to give it a shot as I have saved a lot of time thank to both your, and @drneavin's work.
Thanks @mschilli87! Glad we're both dedicated to getting this to work!
And I think I've sorted it out 😄 . It seems that passing arguments as --VARIABLE <variable>
doesn't work with v3.0.1. The parameters have to be passed as VARIABLE=<variable>
instead. It should be noted that v2.5.4 allowed for the --VARIABLE
passing and that the help options indicate that's how they should be passed still. We should either plan to revert it back to the previous ability or update the docs and the help menu. Probably the latter because it will match the other Drop-seq commands from my understanding
I'll leave this open so we can discuss and decide on the possible options for the parameter passing
Hi @drneavin, I was wondering if that would solve the problem. We still use the old-school OPTION=value style that I won't go into the history of. A bad decision I made ~15 years ago.
Anyway, I think there is another solution if you want to use the --OPTION style: If you put --
immediately after the AssignCellsToSamples and before any of the -OPTIONs, that will tell the shell getopt command not to get involved. I.e.
AssignCellsToSamples -- --CELL_BC_FILE $BARCODES \
--INPUT_BAM $OUTDIR/../../v2.1.0/output/dropulation/possorted_genome_bam_dropulation_tag.bam \
--OUTPUT $DROPULATION_OUTDIR/assignments.tsv.gz \
--VCF $OUTDIR/../../v2.1.0/output/dropulation/invcf.gz \
--SAMPLE_FILE $INDS \
--CELL_BARCODE_TAG 'CB' \
--MOLECULAR_BARCODE_TAG 'UB' \
--VCF_OUTPUT $DROPULATION_OUTDIR/assignment.vcf \
--MAX_ERROR_RATE 0.05 \
--TMP_DIR $DROPULATION_OUTDIR
I don't mind either way honestly - I'm happy to move to using the OPTION=value. I just want it to be easy and clear for other users as well. I'll update the docs on my end to reflect this so anyone using it through Demuxafy shouldn't run into any issues. But maybe in the next release the docs could be updated to use the old-school nomenclature in the help sections of each command?
Unfortunately the argument parser is no longer something I control. I think the best solution would be to remove getopts
from the wrapper scripts completely. The only really useful thing is to set Java heap size, and we could just tell users to export JAVA_OPTS=-Xmx<value>
instead of the current -m
Ok, no worries. I think the good thing is if anyone comes across this error, they should find it when they search the issues. So should be an easy resolution for anyone else who makes the same mistake I made. Happy for you to close out this issue when you're satisfied
Personally I like not having to set environment variables. But I get your point and either way is fine. I just agree with @drneavin (isn't it super late where you are?) that the documentation should match and the rest is resolved by this issue exisiting on the internet.
I'm experiencing the same issue with ReduceGtf
and ConvertToRefFlat
, version 3.0.2, on Linux Ubuntu 20.04, but not on My Mac on macos 15 Sequoia. For the first program, I'm using the following command:
ReduceGtf --GTF resources/genomes/mus_musculus_39_112/curated_annotation.gtf --OUTPUT resources/genomes/mus_musculus_39_112/curated_reduced_annotation.gtf --SEQUENCE_DICTIONARY resources/genomes/mus_musculus_39_112/genome.dict --ENHANCE_GTF false --TMP_DIR /tmp
And the encountered error:
Illegal argument value: Positional arguments were provided ',resources/genomes/mus_musculus_39_112/curated_annotation.gtf}' but no positional argument is defined for this tool.
I also performed an echo
on this command. No hidden characters appear.
For the second one, I'm using the command:
ConvertToRefFlat --ANNOTATIONS_FILE resources/genomes/mus_musculus_39_112/curated_annotation.gtf --OUTPUT resources/genomes/mus_musculus_39_112/curated_annotation.refFlat --SEQUENCE_DICTIONARY resources/genomes/mus_musculus_39_112/genome.dict --TMP_DIR /tmp
And the encountered error:
Illegal argument value: Positional arguments were provided ',resources/genomes/mus_musculus_39_112/curated_annotation.gtf}' but no positional argument is defined for this tool.
Hi @ltalignani ,
Have you tried the work-around I suggested in https://github.com/broadinstitute/Drop-seq/issues/422#issuecomment-2139518165 ?
-Alec
Hi @alecw,
I confirm that the --
solution works.
Hello,
I'm having trouble getting dropulation in Drop-seq v3.0.0 and v3.0.1 to work. For v3.0.0 I receive the errors when I try to run:
and for v3.0.1 I get the following error:
I'm not sure if this is an issue on my end or is an issue that anyone trying to use these functions will come across. I'm using java v21 and building in a singularity image - Ubuntu 20.04. Happy to provide more details to help resolve the issue.