Open mdeloger opened 4 years ago
Dear developers, I am experiencing the same problem (with 200+ GB of RAM). Do you have any recommendations for us? Thank you!
Hello again, I’m adding some details since I am using data produced with BD Rhapsody and I think it would be great for all users if kallisto | bustools could support that format too. BD Rhapsody has quite long cell barcodes (52bp), composed of predefined sequences: [9nt CLS1] + [12nt linker1] + [9nt CLS2] + [13nt linker2] + [9nt CLS3]. The linker sequences are constant, the CLS sequences can vary in a defined manner (97 possible sequences for each CLS, for a total of ~900k different combinations). The Teichmann Lab explained it here nicely: https://teichlab.github.io/scg_lib_structs/methods_html/BD_Rhapsody.html
I have about 80GB of fastqs, and wanted to use kallisto + bustools to obtain the expression count matrix. I first tried with the kb_python wrapper. I built the reference without issues (kb ref'), then I run the ‘kb count’ step, providing the whitelist of 900k+ barcodes and specifying as chemistry '-x 0,0,52:0,52,60:1,0,0' (cell barcodes from 1st to 52th base of R1, UMIs from 53th to 60th base of R1, transcriptomic sequences in whole R2). Apparently it worked (last line in the log was “INFO Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.bus”), but it detected only ~9000 barcodes (we expect more than 50k). I inspected the barcodes produced in “counts_unfiltered/cells_x_genes.barcodes.txt”, and none of these actually are from the whitelist I provided. They were somehow rearranged in this fashion: [last 9nt from linker1] + [CLS2] + GG + [CLS1] + [linker1] + [CLS2] + GG
I tried to re-execute only the barcode correction step with 'bustools correct'. I had used the '--keep-tmp' flag before, and used as input the sorted BUS file 'output.s.bus' from the tmp folder. The command immediately exited with the 'core dumped' error specified above. Also, the number of barcodes found in my whitelist is somehow miscalculated (”Found 9409 barcodes in the whitelist”, when there are 912673).
I tried repeating this with only 10 barcodes and 10 reads from the fastq files, and had the same issues both with 'kb count' (I obtain 10 inexistent barcodes in counts_unfiltered/cells_x_genes.barcodes.txt) and 'bustools correct' (wrong number of barcodes counted from the whitelist + core dumped error).
Versions of tools: kb_python 0.24.4 kallisto 0.46.2 bustools 0.40.0
Sorry for this lengthy message, I hope this can be somehow useful to address the issue. Thank you for these tools and for the time you take to adapt them to the existing technologies!
Hi, I just added an option to kallisto bus
(in the development branch of the kallisto repository, )https://github.com/pachterlab/kallisto/tree/devel which enables custom extraction of barcodes and UMI for the BD Rhapsody WTA assay based on the read structure linked here: https://teichlab.github.io/scg_lib_structs/methods_html/BD_Rhapsody.html. You can run it with kallisto bus -x BDWTA ... R1.fastq.gz R2.fastq.gz
where the barcodes and umi are in R1 and the cDNA is in R2.
What I've done is extracted just the relevant parts of R1, ie only the barcode sequences (ignoring the linkers). This results in a 9+9+9=27 base barcode which bustools can handle.
Please try that out but keep in mind this is still in development and will need to be tested on real data. Note, the whitelist will need to be made in a similar way (removing the linkers).
(edit: This is not an issue with RAM. kallisto bus
and bustools
encode each base (A, T, G, C
) with 2 bits therefore the maximum length barcode, on a 62 bit processor, is 31 bases.)
Dear Sina, amazing, thank you very much! I somehow missed this reply notification, I will test this as soon as possible and get back to you. @mdeloger have you by any chance tested Sina's improvement?
Best, Simo
Hi again @sbooeshaghi,
I tried starting from scratch, cloning the repository and checking out to the devel branch, and then I followed the installation steps (from INSTALL.md). When I run make
, I get the following error:
[ 33%] Building CXX object src/CMakeFiles/kallisto_core.dir/main.cpp.o
/resources/software/kallisto/src/main.cpp: In function ‘bool CheckOptionsBus(ProgramOptions&)’:
/resources/software/kallisto/src/main.cpp:1213:20: error: ‘struct BUSOptionSubstr’ has no member named ‘push_back’
busopt.umi.push_back(BUSOptionSubstr(0, 9 + 12 + 9 + 13 + 9, 9 + 12 + 9 + 13 + 9 + 8)); // umi
^
make[2]: *** [src/CMakeFiles/kallisto_core.dir/main.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/kallisto_core.dir/all] Error 2
make: *** [all] Error 2
This does not happen if I compile it in the master branch. The error seems a common one (http://cplusplus.com/forum/beginner/203789/, https://stackoverflow.com/questions/57138792/why-does-it-queue-has-no-member-named-push-back), but as a C++ illiterate, I'd be very grateful if you could take care of it (or if you can point out what I'm doing wrong).
Thanks ahead and congrats on the publication!!
Ah I know what the issue is. I should have been a bit more thorough in compiling kallisto to make sure that UMIs can come from multiple files, my mistake. The bustools UMI option currently does not support UMIs from multiple files, however this should not be a hard improvement to make. The reason you are getting this error is because the BUS option for the UMI is not a vector (with the push_back operation) unlike the Barcode and Read.
This snippet in the kallisto repo needs to be fixed along with the instances of reading the UMI from the bustools opt. This shouldn't be too hard to do, I'll get back to you.
struct BUSOptions {
int nfiles;
BUSOptionSubstr umi; // <-- This line here
std::vector<BUSOptionSubstr> bc;
std::vector<BUSOptionSubstr> seq;
int getBCLength() const {
int r =0 ;
if (!bc.empty()) {
for (auto& b : bc) {
if (b.start < 0) {
return 0;
} else {
r += b.stop - b.start;
}
}
}
return r;
}
int getUMILength() const {
if (umi.start >= 0) {
return umi.stop - umi.start;
} else {
return 0;
}
}
};
Hi @sbaghai and @mdeloger,
@yenaled has added the ability for the UMI to come from multiple files in a pull-request to the kallisto repo: https://github.com/pachterlab/kallisto/pull/301
You can install kallisto from his fork (https://github.com/Yenaled/kallisto/tree/smart-seq3) and test it out. The only modification to his code you'll need to make is to add the BDWTA
option in the src/main.cpp
as follows (code taken from https://github.com/pachterlab/kallisto/blob/b47fe2d65185eb5388eb819b26404f23eb671890/src/main.cpp#L1205)
else if (opt.technology == "BDWTA")
{
busopt.nfiles = 2;
busopt.bc.push_back(BUSOptionSubstr(0, 0, 9)); // bc1 CLS1
// busopt.bc.push_back(BUSOptionSubstr(0,9,9+12)); // linker
busopt.bc.push_back(BUSOptionSubstr(0, 9 + 12, 9 + 12 + 9)); // bc2 CLS2
// busopt.bc.push_back(BUSOptionSubstr(0,9+12+9,9+12+9+13)); // linker
busopt.bc.push_back(BUSOptionSubstr(0, 9 + 12 + 9 + 13, 9 + 12 + 9 + 13 + 9)); // bc3 CLS3
busopt.umi.push_back(BUSOptionSubstr(0, 9 + 12 + 9 + 13 + 9, 9 + 12 + 9 + 13 + 9 + 8)); // umi
busopt.seq.push_back(BUSOptionSubstr(1, 0, 0));
}
I'm also going to transfer this issue to the kallisto repo as it belongs there.
(Edit) Issues cannot be transferred between organizations.
In theory, you don't even need to modify the code, you could just use the following option when running kallisto bus with my kallisto fork linked to above:
-x 0,0,9,0,21,30,0,43,52:0,52,60:1,0,0
Edit: Just added another commit -- using -x BDWTA should work now!
Quick question can we have multiple fragment files with a shared pool of UMIs with different read pairs with this modification?
@cnk113
Not sure I understand your question. Can you give an example of what you mean?
So I have custom library where I actually have internal reads that have a CB + UMI that match the 3' CB UMI from the same molecule. It's just the read pairs are split and I was wondering if kallisto can theoretically map multiple different reads with the same UMI.
I see, I think I understand. For each file (the internal reads file and the 3' reads file), does the UMI sequence and the CB sequence occur at the same position (e.g. first 16 bp's)? If so, I'd recommend combining the files together and running kallisto bus. If the files don't have the UMI+CB sequences occurring at the same position, I'd recommend modifying the fastq files so that they do occur at the same position. Then it should work.
kallisto handles multiple files where a single barcode and a single UMI sequence are distributed on the same line across the multiple files. It does not process multiple files in the manner you have described (i.e. the lines aren't paired up) -- in the case you described, those separate files are considered separate experiments and you need to combine those files together (so kallisto treats it as a single experiment).
I hope that makes sense!
Yeah, just to clarify, kallisto can take/leverage multiple different reads across the transcript for the same UMI to quantify the most likely transcript instead of using a singe read? This is all from a single experiment and sample.
^Correct.
Hi,
I've been having a similar problem with rhapsody data where bustools only detects ~9000 barcodes from my whitelist, at which point the programme always throws up:
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc
during the bustools correct step (I used the kb count wrapper to execute the whole lot of bustools commands in one go).
Unfortunately, I can't get the fork from @Yenaled installed as I am working on a cluster and was wondering whether there were any updates in terms of support for rhapsody datasets?
^can you show us what command you're running and the full output?
Have you tried it with -x 0,0,9,0,21,30,0,43,52:0,52,60:1,0,0
?
I tried with -x 0,0,9,0,21,30,0,43,52:0,52,60:1,0,0
and it still didn't work (I'm using kb_python, not the fork that was suggested because I haven't yet figured out how to install it onto my cluster yet (not sure if I have permission to).
The full command is
kb count --loom \
-w /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/rhapsodyv1_whitelist.txt \
-i /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/index.idx \
-o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/ \
-x 0,0,9,0,21,30,0,43,52:0,52,60:1,0,0 \
-t 11 \
--workflow lamanno \
--filter \
--verbose \
--overwrite \
--keep-tmp \
-g /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/t2g.txt \
-c1 /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/spliced_t2c.txt \
-c2 /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/unspliced_t2c.txt \
/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_1.fastq.gz /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_2.fastq.gz
And my output is:
[2021-05-10 19:02:07,781] DEBUG Printing verbose output
[2021-05-10 19:02:07,781] DEBUG kallisto binary located at /gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/bins/linux/kall
isto/kallisto
[2021-05-10 19:02:07,781] DEBUG bustools binary located at /gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/bins/linux/bust
ools/bustools
[2021-05-10 19:02:07,781] DEBUG Creating /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp directory
[2021-05-10 19:02:07,790] DEBUG Namespace(bustools='/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/bins/linux/bustools/bu
stools', c1='/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/spliced_t2c.txt', c2='/well/jknight/Andrew/projects/SI_pilot_rhapsody_1712202
0/kb_python_GRCh38_gencodev37/unspliced_t2c.txt', cellranger=False, command='count', dry_run=False, fastqs=['/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/
WTCHG_821954_AN010_1.fastq.gz', '/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_2.fastq.gz'], filter='bustools', g='/well/jknight/Andrew/
projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/t2g.txt', h5ad=False, i='/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/i
ndex.idx', kallisto='/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/bins/linux/kallisto/kallisto', keep_tmp=True, lamanno=F
alse, list=False, loom=True, m='4G', mm=False, no_inspect=False, no_validate=False, nucleus=False, o='/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_
bus_velocity/', overwrite=True, report=False, t=11, tcc=False, tmp=None, verbose=True, w='/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/
rhapsodyv1_whitelist.txt', workflow='lamanno', x='0,0,9,0,21,30,0,43,52:0,52,60:1,0,0')
[2021-05-10 19:02:07,872] INFO Using index /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/index.idx to generate BUS file to /well/jkni
ght/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/ from
[2021-05-10 19:02:07,872] INFO /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_1.fastq.gz
[2021-05-10 19:02:07,872] INFO /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_2.fastq.gz
[2021-05-10 19:02:07,872] DEBUG kallisto bus -i /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/index.idx -o /well/jknight/Andrew/projec
ts/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/ -x 0,0,9,0,21,30,0,43,52:0,52,60:1,0,0 -t 11 /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCH
G_821954_AN010_1.fastq.gz /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_2.fastq.gz
[2021-05-10 19:47:36,002] DEBUG
[2021-05-10 19:47:36,002] DEBUG [index] k-mer length: 31
[2021-05-10 19:47:36,002] DEBUG [index] number of targets: 1,369,325
[2021-05-10 19:47:36,002] DEBUG [index] number of k-mers: 1,566,249,881
[2021-05-10 19:47:36,002] DEBUG [index] number of equivalence classes: 12,659,999
[2021-05-10 19:47:36,002] DEBUG [quant] will process sample 1: /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_1.fastq.gz
[2021-05-10 19:47:36,002] DEBUG /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_2.fastq.gz
[2021-05-10 19:47:36,002] DEBUG [quant] finding pseudoalignments for the reads ... done
[2021-05-10 19:47:36,002] DEBUG [quant] processed 533,754,959 reads, 401,309,352 reads pseudoaligned
[2021-05-10 19:48:13,947] DEBUG /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/output.bus passed validation
[2021-05-10 19:48:13,948] INFO Sorting BUS file /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/output.bus to /well/jknight/Andrew/pro
jects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus
[2021-05-10 19:48:13,948] DEBUG bustools sort -o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus -T /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp -t 11 -m 4G /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/outpu
t.bus
[2021-05-10 19:50:01,391] DEBUG Read in 401309352 BUS records
[2021-05-10 19:50:06,076] DEBUG /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus passed validation
[2021-05-10 19:50:06,077] INFO Inspecting BUS file /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus
[2021-05-10 19:50:06,077] DEBUG bustools inspect -o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/inspect.json -w /well/jknight/Andre
w/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/rhapsodyv1_whitelist.txt -e /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velo
city/matrix.ec /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus
[2021-05-10 19:52:28,983] INFO Correcting BUS records in /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus to /well/jkn
ight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.c.bus with whitelist /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_pyth
on_GRCh38_gencodev37/rhapsodyv1_whitelist.txt
[2021-05-10 19:52:28,983] DEBUG bustools correct -o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.c.bus -w /well/jknight
/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/rhapsodyv1_whitelist.txt /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_v
elocity/tmp/output.s.bus
[2021-05-10 19:52:29,267] DEBUG Found 9216 barcodes in the whitelist
[2021-05-10 19:52:29,268] DEBUG terminate called after throwing an instance of 'std::bad_alloc'
[2021-05-10 19:52:29,268] DEBUG what(): std::bad_alloc
[2021-05-10 19:52:29,275] ERROR Found 9216 barcodes in the whitelist
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[2021-05-10 19:52:29,275] ERROR An exception occurred
Traceback (most recent call last):
File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/main.py", line 866, in main
COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/main.py", line 208, in parse_count
count_velocity(
File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/count.py", line 1541, in count_velocity
correct_result = bustools_correct(
File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/validate.py", line 112, in inner
results = func(*args, **kwargs)
File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/count.py", line 439, in bustools_correct
run_executable(command)
File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/dry/__init__.py", line 24, in inner
return func(*args, **kwargs)
File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/utils.py", line 237, in run_executable
raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/bins/linux/bustools/bustools correct
-o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.c.bus -w /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_pyt
hon_GRCh38_gencodev37/rhapsodyv1_whitelist.txt /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus' died with <Signals.SIGAB
RT: 6>.
Hmm, this is a mystery to me.
Can you try running the "bustools correct" command on its own like:
bustools correct -o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.c.bus -w /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/rhapsodyv1_whitelist.txt /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus
(P.S. Hopefully the tmp/ directory is still there; if not, you might need to use the --keep-tmp argument when you first run kb count)
When I try running the bustools correct command as you've suggested I get:
Found 9216 barcodes in the whitelist
Number of hamming dist 1 barcodes = 340992
Error: barcode length and whitelist length differ, barcodes = 27, whitelist = 52
check that your whitelist matches the technology used
The 9216 barcodes found in the whitelist is consistent with the error that I see when I run kb count.
When I try to run kb count with -x BDWTA
it says the technology is not found. Is it an issue of versions that I'm having? The version of kb_python I have is >>> print(kb.__version__) 0.26.0
Ah, I think I see the problem now! The BD Rhapsody has 52 bp cell barcodes: [9nt CLS1] + [12nt linker1] + [9nt CLS2] + [13nt linker2] + [9nt CLS3].
However, kallisto can only handle barcodes of up to 32 bp in length, which is why we remove the linker sequences (i.e. we only keep the 9nt CLS1, 9nt CLS2, and 9nt CLS3 sequences), making the barcodes 27 bp in length.
However, your whitelist still contains the full 52 bp barcodes -- you need to remove the linker sequences from your rhapsodyv1_whitelist.txt as well.
You can do that via:
cut -c1-9,22-30,44-52 rhapsodyv1_whitelist.txt > updated_barcodes.txt
And then supply the updated_barcodes.txt file to kb. Hopefully, in a future release, we'll have kb automate this entire step.
Ah, very sorry, that was very silly of me! It works now after supplying the shortened barcodes, thanks :)
Just to check: -x BDWTA
still doesn't work - I'm guessing this is because it hasn't been formally rolled out yet and not that I have a version issue?
Correct, -x BDWTA has not been formally rolled out yet.
Any updates here? I would love to be able to benchmark against the BD-provided pipeline.
Also, will this support WTA + AbSeq?
For BD WTA, you can use the pipeline we provided in the comments above. For Abseq, this would require the kallisto feature barcoding workflow (called kite): I'll defer to @sbooeshaghi for further info on this.
All the best with benchmarking -- would be great if you share some of the benchmarking results with us! :)
kb
can quantify feature barcoding data with a list of targeted sequences. Are the WTA + AbSeq FASTQ reads mixed or separated? Is the barcode / UMI structure the same?
I think they would be mixed. The cell label / umi architecture is the same.
Is there some known sequence in the biological read that allows one to distinguish between WTA and AbSeq reads?
Each AbSeq read is associated with a sequence the corresponds to a cognate protein.
The vendor pipeline requires the user to specify what those sequences are. Here is an example from one of their publicly available datasets: AbSeqDemo_AbOligoReference.fasta.txt
I think this is a discussion that is very helpful, with a lot of good information.
I apologize that this is a bit off topic, but would you mind sharing the source of the whitelist.txt file that you used for the BD Rhapsody dataset that you used?
This is the barcode whitelist that I used.
Like @jeremyadamsfisher, my WTA + AbSeq fastq reads are mixed. I'd asked for support in pachterlab/kb_python#123 but am not sure there has been an update yet. One thing to note would be that I've discovered that the abseq amplicon structure isn't entirely the same as the RNA/WTA one where there is an extra 12bp UMI before abseq sequence in read 2, and the polydT is 25 instead of 18bp. I haven't managed to get things working with bustools or kb_python but would be great if bustools/kb_python could support the BD WTA + Abseq platform.
Thank you very much, @andrewjkwok !
I am trying to re-analyze data from SRR14144057. Locally, I think I mostly see 10x scRNA-Seq.
However, otherwise, I am trying to do some testing with a journal club presentation for the STARsolo paper. So, in principle, I think that could be an advantage to the method (if you can define a white list in the right format). I am also trying to run Kallisto and Alevin with 1 set of parameters each.
Thanks again! I apologize again for getting a little off topic.
This is the barcode whitelist that I used.
Like @jeremyadamsfisher, my WTA + AbSeq fastq reads are mixed. I'd asked for support in pachterlab/kb_python#123 but am not sure there has been an update yet. One thing to note would be that I've discovered that the abseq amplicon structure isn't entirely the same as the RNA/WTA one where there is an extra 12bp UMI before abseq sequence in read 2, and the polydT is 25 instead of 18bp. I haven't managed to get things working with bustools or kb_python but would be great if bustools/kb_python could support the BD WTA + Abseq platform.
Hi @cwarden45 @andrewjkwok did you manage to use kb to align BD WTA + Abseq ? Thanks
@cathalgking From my end, I didn't continue to troubleshoot much longer, since I don't think I have encountered a local lab that produces that type of data.
Sorry that I can't be more help!
@cwarden45 ok. So from my understanding of the thread, kb can align BD data, but not with Abseq present?
@cathalgking I don't believe that I had any secondary library with the public data, but I didn't do much testing at all after the earlier communication.
If you have the whitelist that was provided, then I am not sure if that might help. I think that might not have been sufficient to solve the problem at my end. However, if you change additional configurations, then I am not sure if that might be sufficient to be BD data to work (without Abseq).
However, again, I hope that somebody else can provide a more helpful answer.
It was a while ago. However, if it somehow helps, I think this was the error message that I countered:
Error: technology string must contain two colons (:), none found: "BDWTA"
Unable to create technology: BDWTA
I am not going to continue troubleshooting for this public sample. However, I don't know if this helps find the solution for others.
@sbooeshaghi Is it possible to process BD Rhapsody + Abseq reads using kb?
Any updates here? I would love to be able to benchmark against the BD-provided pipeline.
Also, will this support WTA + AbSeq?
@jeremyadamsfisher Did you manage to process the BD + Abseq data to a count matrix using kb?
Hi,
I have an experiment wih a whitelist composed of 35 barcodes (60bp long, non standard technology in development) and when I try to run bustools correct I immediately obtain :
(/bioinfo/local/build/Centos/envs_conda/kallisto_0.46.1) -bash-4.2$ bustools correct -w whitelist.txt -o output.sorted.correct.bus output.sort.bus Found 35 barcodes in the whitelist terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)
I tried to increase RAM to 200Gb but seems not to solve the problem.
May you help me to understand where I am wrong, please ?
Thank you in advance