Raise maximum barcode length

kaukrise commented 4 years ago

Hi,

first of all congratulations on a great tool, the expansion to scRNA-seq analysis is especially appreciated!

I was wondering what the reason for setting an upper limit on the barcode length in alevin is - would longer barcodes affect the computation in some manner? We are working with barcodes of length 27, which are incompatible with the hardcoded upper barcode length limit here.

I manually raised the limit on a modified alevin version, and the final output looks as expected, so if there is no risk that I am unaware of, would you consider raising or removing the barcode length limit altogether?

Thank you for you help!

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)? alevin

Describe the bug Using the manual barcode and UMI specification with --end, --barcodeLength, and --umiLength fails for barcodes longer than 20 with the error message:

Barcode length (27) was not in the required length range [1, 20].

The barcode length upper limit is hardcoded here.

To Reproduce In Salmon 1.0.0, run salmon alevin [...] --end 5 --barcodeLength 27 --umiLength 8 (or any barcodeLength value above 20).

Which version of salmon was used? 1.0.0
How was salmon installed (compiled, downloaded executable, through bioconda)? bioconda
Which reference (e.g. transcriptome) was used? not relevant
Which read files were used? not relevant
Which which program options were used? --end 5 --barcodeLength 27 --umiLength 8

Expected behavior Ideally, barcode longer than 20 would be processed as normal.

Desktop (please complete the following information):

OS: [e.g. Ubuntu Linux, OSX] Mac OS X
Version [ If you are on OSX, the output of sw_vers. If you are on linux the output of uname -a and lsb_release -a] 10.14.6 (18G103)

Additional context See top of post

k3yavi commented 4 years ago

Hi @kaukrise ,

Thanks for the very interesting question. I don't think there is any theoretical limit wrt the alevin's method, however, it would be interesting to check how does alevin performs when we increase the CB length wrt the running time. The 20 length bound was just for sanity checking and can be increased, like you already did. I'd be very interested, if possible, in hearing back about your experience with alevin using longer length CB both wrt running time and gene expression estimates generated. Also if I may ask what's the reason behind using this long CB ? Are you expecting tons of real cells, if there is we can think about improving alevin even more, in my experience, we have generally seen individual 10x experiment with ~20k cells max. Even the 1.3M dataset is 164 separate experiments.

kaukrise commented 4 years ago

Thank you for the swift answer!

We are working with BD Rhapsody, which uses a complex barcode structure (you can read about this in their bioinformatics handbook on page 14). The extracted, combined CB is 27bp long, which is why the default sanity check was too low for our purposes.

In terms of cell numbers, BD Rhapsody appears to generate a lot of "false-positive cells", actually (we are seeing up to 90% of false positives). This is expected, and also mentioned in their bioinformatics handbook (pages 23-25), but appears to be an issue for the alevin cell detection: with standard settings this is approximately two orders of magnitude lower than expected, --expectCells improves matters drastically, however. We have opted for removing the false positives in post-processing ourselves - the low count depth population is very easily identifiable.

In terms of performance, a complete alevin run on 150M reads (25k expected cells) takes around 1.5 hours using 10 threads, which is perfectly reasonable for us.

k3yavi commented 4 years ago

My apologies for the late reply, somehow I missed the reply. I am glad to hear that and thanks for testing alevin with BD Rhapsody. Let us know if you need help with anything else, we'd be happy to help.

Closing this issue but feel free to reopen.

gringer commented 3 years ago

Could you please increase the maximum barcode length, as per this issue [i.e. here]? Salmon is still complaining when I use a converted file with the linker sequences spliced out, having a concatenated cell barcode length of 27, and there doesn't seem to be any user-definable option to modify or ignore the limit:

gringer@elegans:/mnt/ufds/jmayer$ salmon alevin -l ISR -1 normalised_H2GYLDRXY_1_210203_FD09251586_Other_CGAGGCTG_R_210203_DAVGAL_INDEXLIBNOVASEQ_M001_R1.fastq.gz normalised_H2GYLDRXY_2_210203_FD09251586_Other_CGAGGCTG_R_210203_DAVGAL_INDEXLIBNOVASEQ_M001_R1.fastq.gz -2 H2GYLDRXY_1_210203_FD09251586_Other_CGAGGCTG_R_210203_DAVGAL_INDEXLIBNOVASEQ_M001_R2.fastq.gz -2 H2GYLDRXY_2_210203_FD09251586_Other_CGAGGCTG_R_210203_DAVGAL_INDEXLIBNOVASEQ_M001_R2.fastq.gz -i /mnt/ufds/salmon/gencode_M23/salmon_1.4.0_decoy_M23 -p 10 -o salmon_1.4_5_27_8_JM_2021-02-12 --tgMap txp2gene.txt --end 5 --barcodeLength 27 --umiLength 8
[2021-02-12 11:01:37.654] [alevinLog] [warning] Note: the use of --end, --barcodeLength and --umiLength to describe the barcode and umi geometry is deprecated. Please start using the `--barcode-geometry` and `--umi-geometry` options instead.
[2021-02-12 11:01:37.655] [alevinLog] [error] Barcode length (27) was not in the required length range [1, 20].
Exiting now.

gringer commented 3 years ago

As an alternative, documenting the geometry format for specifying custom barcodes would be helpful. This seems to avoid the barcode length issue.

From what I can tell, the format is <readNum>[start-end], i.e. for my case:

--umi-geometry '1[28-35]' --bc-geometry '1[1-27]' --read-geometry '2[1-end]'

rob-p commented 3 years ago

Hi @gringer,

Yes, we can add a section for this in the docs. It will replace the old way for specifying geometry soon, as its just easier and more flexible. We talk about it in the 1.4.0 release notes. I copy the relevant info below (@k3yavi pulled for the 1-based indexing and won out ... this time):

generic barcode / umi / read geometry syntax : Alevin learned to support a generic syntax to specify the read sequence that should be used for barcodes, UMIs and the read sequence. The syntax allows one to specify how the pattern corresponding to the barcode, UMI, and read sequence should be pieced together, and the syntax is meant to be intuitive and general. For example, one can specify the 10Xv2 geometry in the following manner using the generic syntax:

--read-geometry 2[1-end] --bc-geometry 1[1-16] --umi-geometry 1[17-26]

This specifies that the "sequence" read (the biological sequence to be aligned) comes from read 2, and it spans from the first index 1 (this syntax used 1-based indexing) until the end of the read. Likewise, the barcode derives from read 1 and occupies positions 1-16, and the UMI comes from read 1 and occupies positions 17-26. The syntax can specify multiple ranges, and they will simply be concatenated together to produce the string. For example, one could specify --bc-geometry 1[1-8,16-23] to designate that the barcode should be taken from the substring in positions 1-8 of read 1 followed by the substring in positions 16-23 of read 1. It is even possible to have the string pieced together across both reads, but that functionality is only available if you are running with --rad or --sketch and preparing a RAD file for alevin-fry. If you are running classic alevin, the barcode must reside on a single read. The robust parsing of the flexible geometry syntax is made possible by the cpp-peglib project.

gringer commented 3 years ago

Oh, excellent, thanks. The multi-range format will be useful for Rhapsody reads.

COMBINE-lab / salmon

Raise maximum barcode length #445