BUStools / bustools

Tools for working with BUS files
https://bustools.github.io/
BSD 2-Clause "Simplified" License
90 stars 23 forks source link

bustools correct : 'std::bad_alloc' #49

Open mdeloger opened 4 years ago

mdeloger commented 4 years ago

Hi,

I have an experiment wih a whitelist composed of 35 barcodes (60bp long, non standard technology in development) and when I try to run bustools correct I immediately obtain :

(/bioinfo/local/build/Centos/envs_conda/kallisto_0.46.1) -bash-4.2$ bustools correct -w whitelist.txt -o output.sorted.correct.bus output.sort.bus Found 35 barcodes in the whitelist terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

I tried to increase RAM to 200Gb but seems not to solve the problem.

May you help me to understand where I am wrong, please ?

Thank you in advance

sbaghai commented 4 years ago

Dear developers, I am experiencing the same problem (with 200+ GB of RAM). Do you have any recommendations for us? Thank you!

sbaghai commented 4 years ago

Hello again, I’m adding some details since I am using data produced with BD Rhapsody and I think it would be great for all users if kallisto | bustools could support that format too. BD Rhapsody has quite long cell barcodes (52bp), composed of predefined sequences: [9nt CLS1] + [12nt linker1] + [9nt CLS2] + [13nt linker2] + [9nt CLS3]. The linker sequences are constant, the CLS sequences can vary in a defined manner (97 possible sequences for each CLS, for a total of ~900k different combinations). The Teichmann Lab explained it here nicely: https://teichlab.github.io/scg_lib_structs/methods_html/BD_Rhapsody.html

I have about 80GB of fastqs, and wanted to use kallisto + bustools to obtain the expression count matrix. I first tried with the kb_python wrapper. I built the reference without issues (kb ref'), then I run the ‘kb count’ step, providing the whitelist of 900k+ barcodes and specifying as chemistry '-x 0,0,52:0,52,60:1,0,0' (cell barcodes from 1st to 52th base of R1, UMIs from 53th to 60th base of R1, transcriptomic sequences in whole R2). Apparently it worked (last line in the log was “INFO Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.bus”), but it detected only ~9000 barcodes (we expect more than 50k). I inspected the barcodes produced in “counts_unfiltered/cells_x_genes.barcodes.txt”, and none of these actually are from the whitelist I provided. They were somehow rearranged in this fashion: [last 9nt from linker1] + [CLS2] + GG + [CLS1] + [linker1] + [CLS2] + GG

I tried to re-execute only the barcode correction step with 'bustools correct'. I had used the '--keep-tmp' flag before, and used as input the sorted BUS file 'output.s.bus' from the tmp folder. The command immediately exited with the 'core dumped' error specified above. Also, the number of barcodes found in my whitelist is somehow miscalculated (”Found 9409 barcodes in the whitelist”, when there are 912673).

I tried repeating this with only 10 barcodes and 10 reads from the fastq files, and had the same issues both with 'kb count' (I obtain 10 inexistent barcodes in counts_unfiltered/cells_x_genes.barcodes.txt) and 'bustools correct' (wrong number of barcodes counted from the whitelist + core dumped error).

Versions of tools: kb_python 0.24.4 kallisto 0.46.2 bustools 0.40.0

Sorry for this lengthy message, I hope this can be somehow useful to address the issue. Thank you for these tools and for the time you take to adapt them to the existing technologies!

sbooeshaghi commented 3 years ago

Hi, I just added an option to kallisto bus (in the development branch of the kallisto repository, )https://github.com/pachterlab/kallisto/tree/devel which enables custom extraction of barcodes and UMI for the BD Rhapsody WTA assay based on the read structure linked here: https://teichlab.github.io/scg_lib_structs/methods_html/BD_Rhapsody.html. You can run it with kallisto bus -x BDWTA ... R1.fastq.gz R2.fastq.gz where the barcodes and umi are in R1 and the cDNA is in R2.

What I've done is extracted just the relevant parts of R1, ie only the barcode sequences (ignoring the linkers). This results in a 9+9+9=27 base barcode which bustools can handle.

Please try that out but keep in mind this is still in development and will need to be tested on real data. Note, the whitelist will need to be made in a similar way (removing the linkers).

(edit: This is not an issue with RAM. kallisto bus and bustools encode each base (A, T, G, C) with 2 bits therefore the maximum length barcode, on a 62 bit processor, is 31 bases.)

sbaghai commented 3 years ago

Dear Sina, amazing, thank you very much! I somehow missed this reply notification, I will test this as soon as possible and get back to you. @mdeloger have you by any chance tested Sina's improvement?

Best, Simo

sbaghai commented 3 years ago

Hi again @sbooeshaghi,

I tried starting from scratch, cloning the repository and checking out to the devel branch, and then I followed the installation steps (from INSTALL.md). When I run make, I get the following error:


[ 33%] Building CXX object src/CMakeFiles/kallisto_core.dir/main.cpp.o
/resources/software/kallisto/src/main.cpp: In function ‘bool CheckOptionsBus(ProgramOptions&)’:
/resources/software/kallisto/src/main.cpp:1213:20: error: ‘struct BUSOptionSubstr’ has no member named ‘push_back’
         busopt.umi.push_back(BUSOptionSubstr(0, 9 + 12 + 9 + 13 + 9, 9 + 12 + 9 + 13 + 9 + 8)); // umi
                    ^
make[2]: *** [src/CMakeFiles/kallisto_core.dir/main.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/kallisto_core.dir/all] Error 2
make: *** [all] Error 2

This does not happen if I compile it in the master branch. The error seems a common one (http://cplusplus.com/forum/beginner/203789/, https://stackoverflow.com/questions/57138792/why-does-it-queue-has-no-member-named-push-back), but as a C++ illiterate, I'd be very grateful if you could take care of it (or if you can point out what I'm doing wrong).

Thanks ahead and congrats on the publication!!

sbooeshaghi commented 3 years ago

Ah I know what the issue is. I should have been a bit more thorough in compiling kallisto to make sure that UMIs can come from multiple files, my mistake. The bustools UMI option currently does not support UMIs from multiple files, however this should not be a hard improvement to make. The reason you are getting this error is because the BUS option for the UMI is not a vector (with the push_back operation) unlike the Barcode and Read.

This snippet in the kallisto repo needs to be fixed along with the instances of reading the UMI from the bustools opt. This shouldn't be too hard to do, I'll get back to you.

https://github.com/pachterlab/kallisto/blob/ae81a8620c5cc247eba2128866faefa6e3c6e03e/src/common.h#L25

struct BUSOptions {
  int nfiles;

  BUSOptionSubstr umi;            // <-- This line here
  std::vector<BUSOptionSubstr> bc;
  std::vector<BUSOptionSubstr> seq;

  int getBCLength() const {
    int r =0 ;
    if (!bc.empty()) {
      for (auto& b : bc) {
        if (b.start < 0) {
          return 0;
        } else {
          r += b.stop - b.start;
        }
      }
    }
    return r;
  }

  int getUMILength() const {
    if (umi.start >= 0) {
      return umi.stop - umi.start;
    } else {
      return 0;
    }
  }
};
sbooeshaghi commented 3 years ago

Hi @sbaghai and @mdeloger,

@yenaled has added the ability for the UMI to come from multiple files in a pull-request to the kallisto repo: https://github.com/pachterlab/kallisto/pull/301

You can install kallisto from his fork (https://github.com/Yenaled/kallisto/tree/smart-seq3) and test it out. The only modification to his code you'll need to make is to add the BDWTA option in the src/main.cpp as follows (code taken from https://github.com/pachterlab/kallisto/blob/b47fe2d65185eb5388eb819b26404f23eb671890/src/main.cpp#L1205)

  else if (opt.technology == "BDWTA")
  {
    busopt.nfiles = 2;
    busopt.bc.push_back(BUSOptionSubstr(0, 0, 9)); // bc1 CLS1
    // busopt.bc.push_back(BUSOptionSubstr(0,9,9+12)); // linker
    busopt.bc.push_back(BUSOptionSubstr(0, 9 + 12, 9 + 12 + 9)); // bc2 CLS2
    // busopt.bc.push_back(BUSOptionSubstr(0,9+12+9,9+12+9+13)); // linker
    busopt.bc.push_back(BUSOptionSubstr(0, 9 + 12 + 9 + 13, 9 + 12 + 9 + 13 + 9));          // bc3 CLS3
    busopt.umi.push_back(BUSOptionSubstr(0, 9 + 12 + 9 + 13 + 9, 9 + 12 + 9 + 13 + 9 + 8)); // umi
    busopt.seq.push_back(BUSOptionSubstr(1, 0, 0));
  }

I'm also going to transfer this issue to the kallisto repo as it belongs there.

(Edit) Issues cannot be transferred between organizations.

Yenaled commented 3 years ago

In theory, you don't even need to modify the code, you could just use the following option when running kallisto bus with my kallisto fork linked to above:

-x 0,0,9,0,21,30,0,43,52:0,52,60:1,0,0

Edit: Just added another commit -- using -x BDWTA should work now!

cnk113 commented 3 years ago

Quick question can we have multiple fragment files with a shared pool of UMIs with different read pairs with this modification?

Yenaled commented 3 years ago

@cnk113

Not sure I understand your question. Can you give an example of what you mean?

cnk113 commented 3 years ago

So I have custom library where I actually have internal reads that have a CB + UMI that match the 3' CB UMI from the same molecule. It's just the read pairs are split and I was wondering if kallisto can theoretically map multiple different reads with the same UMI.

Yenaled commented 3 years ago

I see, I think I understand. For each file (the internal reads file and the 3' reads file), does the UMI sequence and the CB sequence occur at the same position (e.g. first 16 bp's)? If so, I'd recommend combining the files together and running kallisto bus. If the files don't have the UMI+CB sequences occurring at the same position, I'd recommend modifying the fastq files so that they do occur at the same position. Then it should work.

kallisto handles multiple files where a single barcode and a single UMI sequence are distributed on the same line across the multiple files. It does not process multiple files in the manner you have described (i.e. the lines aren't paired up) -- in the case you described, those separate files are considered separate experiments and you need to combine those files together (so kallisto treats it as a single experiment).

I hope that makes sense!

cnk113 commented 3 years ago

Yeah, just to clarify, kallisto can take/leverage multiple different reads across the transcript for the same UMI to quantify the most likely transcript instead of using a singe read? This is all from a single experiment and sample.

Yenaled commented 3 years ago

^Correct.

andrewjkwok commented 3 years ago

Hi,

I've been having a similar problem with rhapsody data where bustools only detects ~9000 barcodes from my whitelist, at which point the programme always throws up:

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc during the bustools correct step (I used the kb count wrapper to execute the whole lot of bustools commands in one go).

Unfortunately, I can't get the fork from @Yenaled installed as I am working on a cluster and was wondering whether there were any updates in terms of support for rhapsody datasets?

Yenaled commented 3 years ago

^can you show us what command you're running and the full output?

Have you tried it with -x 0,0,9,0,21,30,0,43,52:0,52,60:1,0,0?

andrewjkwok commented 3 years ago

I tried with -x 0,0,9,0,21,30,0,43,52:0,52,60:1,0,0 and it still didn't work (I'm using kb_python, not the fork that was suggested because I haven't yet figured out how to install it onto my cluster yet (not sure if I have permission to).

The full command is

kb count --loom \
-w /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/rhapsodyv1_whitelist.txt \
-i /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/index.idx \
-o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/ \
-x 0,0,9,0,21,30,0,43,52:0,52,60:1,0,0 \
-t 11 \
--workflow lamanno \
--filter \
--verbose \
--overwrite \
--keep-tmp \
-g /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/t2g.txt \
-c1 /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/spliced_t2c.txt \
-c2  /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/unspliced_t2c.txt \
/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_1.fastq.gz /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_2.fastq.gz

And my output is:

[2021-05-10 19:02:07,781]   DEBUG Printing verbose output
[2021-05-10 19:02:07,781]   DEBUG kallisto binary located at /gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/bins/linux/kall
isto/kallisto
[2021-05-10 19:02:07,781]   DEBUG bustools binary located at /gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/bins/linux/bust
ools/bustools
[2021-05-10 19:02:07,781]   DEBUG Creating /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp directory
[2021-05-10 19:02:07,790]   DEBUG Namespace(bustools='/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/bins/linux/bustools/bu
stools', c1='/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/spliced_t2c.txt', c2='/well/jknight/Andrew/projects/SI_pilot_rhapsody_1712202
0/kb_python_GRCh38_gencodev37/unspliced_t2c.txt', cellranger=False, command='count', dry_run=False, fastqs=['/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/
WTCHG_821954_AN010_1.fastq.gz', '/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_2.fastq.gz'], filter='bustools', g='/well/jknight/Andrew/
projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/t2g.txt', h5ad=False, i='/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/i
ndex.idx', kallisto='/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/bins/linux/kallisto/kallisto', keep_tmp=True, lamanno=F
alse, list=False, loom=True, m='4G', mm=False, no_inspect=False, no_validate=False, nucleus=False, o='/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_
bus_velocity/', overwrite=True, report=False, t=11, tcc=False, tmp=None, verbose=True, w='/well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/
rhapsodyv1_whitelist.txt', workflow='lamanno', x='0,0,9,0,21,30,0,43,52:0,52,60:1,0,0')
[2021-05-10 19:02:07,872]    INFO Using index /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/index.idx to generate BUS file to /well/jkni
ght/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/ from
[2021-05-10 19:02:07,872]    INFO         /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_1.fastq.gz
[2021-05-10 19:02:07,872]    INFO         /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_2.fastq.gz
[2021-05-10 19:02:07,872]   DEBUG kallisto bus -i /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/index.idx -o /well/jknight/Andrew/projec
ts/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/ -x 0,0,9,0,21,30,0,43,52:0,52,60:1,0,0 -t 11 /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCH
G_821954_AN010_1.fastq.gz /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_2.fastq.gz
[2021-05-10 19:47:36,002]   DEBUG 
[2021-05-10 19:47:36,002]   DEBUG [index] k-mer length: 31
[2021-05-10 19:47:36,002]   DEBUG [index] number of targets: 1,369,325
[2021-05-10 19:47:36,002]   DEBUG [index] number of k-mers: 1,566,249,881
[2021-05-10 19:47:36,002]   DEBUG [index] number of equivalence classes: 12,659,999
[2021-05-10 19:47:36,002]   DEBUG [quant] will process sample 1: /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_1.fastq.gz
[2021-05-10 19:47:36,002]   DEBUG /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/data/raw/WTCHG_821954_AN010_2.fastq.gz
[2021-05-10 19:47:36,002]   DEBUG [quant] finding pseudoalignments for the reads ... done
[2021-05-10 19:47:36,002]   DEBUG [quant] processed 533,754,959 reads, 401,309,352 reads pseudoaligned
[2021-05-10 19:48:13,947]   DEBUG /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/output.bus passed validation
[2021-05-10 19:48:13,948]    INFO Sorting BUS file /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/output.bus to /well/jknight/Andrew/pro
jects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus
[2021-05-10 19:48:13,948]   DEBUG bustools sort -o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus -T /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp -t 11 -m 4G /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/outpu
t.bus
[2021-05-10 19:50:01,391]   DEBUG Read in 401309352 BUS records
[2021-05-10 19:50:06,076]   DEBUG /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus passed validation
[2021-05-10 19:50:06,077]    INFO Inspecting BUS file /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus
[2021-05-10 19:50:06,077]   DEBUG bustools inspect -o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/inspect.json -w /well/jknight/Andre
w/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/rhapsodyv1_whitelist.txt -e /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velo
city/matrix.ec /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus
[2021-05-10 19:52:28,983]    INFO Correcting BUS records in /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus to /well/jkn
ight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.c.bus with whitelist /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_pyth
on_GRCh38_gencodev37/rhapsodyv1_whitelist.txt
[2021-05-10 19:52:28,983]   DEBUG bustools correct -o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.c.bus -w /well/jknight
/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/rhapsodyv1_whitelist.txt /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_v
elocity/tmp/output.s.bus
[2021-05-10 19:52:29,267]   DEBUG Found 9216 barcodes in the whitelist
[2021-05-10 19:52:29,268]   DEBUG terminate called after throwing an instance of 'std::bad_alloc'
[2021-05-10 19:52:29,268]   DEBUG what():  std::bad_alloc
[2021-05-10 19:52:29,275]   ERROR Found 9216 barcodes in the whitelist
terminate called after throwing an instance of 'std::bad_alloc'
what():  std::bad_alloc
[2021-05-10 19:52:29,275]   ERROR An exception occurred
Traceback (most recent call last):
  File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/main.py", line 866, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/main.py", line 208, in parse_count
    count_velocity(
  File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/count.py", line 1541, in count_velocity
    correct_result = bustools_correct(
  File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/validate.py", line 112, in inner
    results = func(*args, **kwargs)
  File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/count.py", line 439, in bustools_correct
    run_executable(command)
  File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/dry/__init__.py", line 24, in inner
    return func(*args, **kwargs)
  File "/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/utils.py", line 237, in run_executable
    raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/gpfs2/well/jknight/Andrew/python/rhapsody_May2021_ivybridge/lib/python3.8/site-packages/kb_python/bins/linux/bustools/bustools correct
 -o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.c.bus -w /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_pyt
hon_GRCh38_gencodev37/rhapsodyv1_whitelist.txt /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus' died with <Signals.SIGAB
RT: 6>.
Yenaled commented 3 years ago

Hmm, this is a mystery to me.

Can you try running the "bustools correct" command on its own like:

bustools correct -o /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.c.bus -w /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/kb_python_GRCh38_gencodev37/rhapsodyv1_whitelist.txt /well/jknight/Andrew/projects/SI_pilot_rhapsody_17122020/output/kallisto_bus_velocity/tmp/output.s.bus

(P.S. Hopefully the tmp/ directory is still there; if not, you might need to use the --keep-tmp argument when you first run kb count)

andrewjkwok commented 3 years ago

When I try running the bustools correct command as you've suggested I get:

Found 9216 barcodes in the whitelist
Number of hamming dist 1 barcodes = 340992
Error: barcode length and whitelist length differ, barcodes = 27, whitelist = 52
       check that your whitelist matches the technology used

The 9216 barcodes found in the whitelist is consistent with the error that I see when I run kb count.

When I try to run kb count with -x BDWTA it says the technology is not found. Is it an issue of versions that I'm having? The version of kb_python I have is >>> print(kb.__version__) 0.26.0

Yenaled commented 3 years ago

Ah, I think I see the problem now! The BD Rhapsody has 52 bp cell barcodes: [9nt CLS1] + [12nt linker1] + [9nt CLS2] + [13nt linker2] + [9nt CLS3].

However, kallisto can only handle barcodes of up to 32 bp in length, which is why we remove the linker sequences (i.e. we only keep the 9nt CLS1, 9nt CLS2, and 9nt CLS3 sequences), making the barcodes 27 bp in length.

However, your whitelist still contains the full 52 bp barcodes -- you need to remove the linker sequences from your rhapsodyv1_whitelist.txt as well.

You can do that via:

cut -c1-9,22-30,44-52 rhapsodyv1_whitelist.txt > updated_barcodes.txt

And then supply the updated_barcodes.txt file to kb. Hopefully, in a future release, we'll have kb automate this entire step.

andrewjkwok commented 3 years ago

Ah, very sorry, that was very silly of me! It works now after supplying the shortened barcodes, thanks :)

Just to check: -x BDWTA still doesn't work - I'm guessing this is because it hasn't been formally rolled out yet and not that I have a version issue?

Yenaled commented 3 years ago

Correct, -x BDWTA has not been formally rolled out yet.

jeremyadamsfisher commented 3 years ago

Any updates here? I would love to be able to benchmark against the BD-provided pipeline.

Also, will this support WTA + AbSeq?

Yenaled commented 3 years ago

For BD WTA, you can use the pipeline we provided in the comments above. For Abseq, this would require the kallisto feature barcoding workflow (called kite): I'll defer to @sbooeshaghi for further info on this.

All the best with benchmarking -- would be great if you share some of the benchmarking results with us! :)

sbooeshaghi commented 3 years ago

kb can quantify feature barcoding data with a list of targeted sequences. Are the WTA + AbSeq FASTQ reads mixed or separated? Is the barcode / UMI structure the same?

jeremyadamsfisher commented 3 years ago

I think they would be mixed. The cell label / umi architecture is the same.

sbooeshaghi commented 3 years ago

Is there some known sequence in the biological read that allows one to distinguish between WTA and AbSeq reads?

jeremyadamsfisher commented 3 years ago

Each AbSeq read is associated with a sequence the corresponds to a cognate protein.

The vendor pipeline requires the user to specify what those sequences are. Here is an example from one of their publicly available datasets: AbSeqDemo_AbOligoReference.fasta.txt

cwarden45 commented 3 years ago

I think this is a discussion that is very helpful, with a lot of good information.

I apologize that this is a bit off topic, but would you mind sharing the source of the whitelist.txt file that you used for the BD Rhapsody dataset that you used?

andrewjkwok commented 3 years ago

cell_labels_2Comm.fasta.gz

This is the barcode whitelist that I used.

Like @jeremyadamsfisher, my WTA + AbSeq fastq reads are mixed. I'd asked for support in pachterlab/kb_python#123 but am not sure there has been an update yet. One thing to note would be that I've discovered that the abseq amplicon structure isn't entirely the same as the RNA/WTA one where there is an extra 12bp UMI before abseq sequence in read 2, and the polydT is 25 instead of 18bp. I haven't managed to get things working with bustools or kb_python but would be great if bustools/kb_python could support the BD WTA + Abseq platform.

cwarden45 commented 3 years ago

Thank you very much, @andrewjkwok !

I am trying to re-analyze data from SRR14144057. Locally, I think I mostly see 10x scRNA-Seq.

However, otherwise, I am trying to do some testing with a journal club presentation for the STARsolo paper. So, in principle, I think that could be an advantage to the method (if you can define a white list in the right format). I am also trying to run Kallisto and Alevin with 1 set of parameters each.

Thanks again! I apologize again for getting a little off topic.

cathalgking commented 2 years ago

cell_labels_2Comm.fasta.gz

This is the barcode whitelist that I used.

Like @jeremyadamsfisher, my WTA + AbSeq fastq reads are mixed. I'd asked for support in pachterlab/kb_python#123 but am not sure there has been an update yet. One thing to note would be that I've discovered that the abseq amplicon structure isn't entirely the same as the RNA/WTA one where there is an extra 12bp UMI before abseq sequence in read 2, and the polydT is 25 instead of 18bp. I haven't managed to get things working with bustools or kb_python but would be great if bustools/kb_python could support the BD WTA + Abseq platform.

Hi @cwarden45 @andrewjkwok did you manage to use kb to align BD WTA + Abseq ? Thanks

cwarden45 commented 2 years ago

@cathalgking From my end, I didn't continue to troubleshoot much longer, since I don't think I have encountered a local lab that produces that type of data.

Sorry that I can't be more help!

cathalgking commented 2 years ago

@cwarden45 ok. So from my understanding of the thread, kb can align BD data, but not with Abseq present?

cwarden45 commented 2 years ago

@cathalgking I don't believe that I had any secondary library with the public data, but I didn't do much testing at all after the earlier communication.

If you have the whitelist that was provided, then I am not sure if that might help. I think that might not have been sufficient to solve the problem at my end. However, if you change additional configurations, then I am not sure if that might be sufficient to be BD data to work (without Abseq).

However, again, I hope that somebody else can provide a more helpful answer.

cwarden45 commented 2 years ago

It was a while ago. However, if it somehow helps, I think this was the error message that I countered:

Error: technology string must contain two colons (:), none found: "BDWTA"
Unable to create technology: BDWTA

I am not going to continue troubleshooting for this public sample. However, I don't know if this helps find the solution for others.

cathalgking commented 2 years ago

@sbooeshaghi Is it possible to process BD Rhapsody + Abseq reads using kb?

cathalgking commented 2 years ago

Any updates here? I would love to be able to benchmark against the BD-provided pipeline.

Also, will this support WTA + AbSeq?

@jeremyadamsfisher Did you manage to process the BD + Abseq data to a count matrix using kb?