Closed camilafarias112 closed 5 years ago
Also, when I made the list of 147,456 barcodes, by making a new .txt file with all possible combinations of "half"-barcodes barcodes_inDrops.txt
By doing this:
bustools correct -w ../barcodes_inDrops.txt -o output.correct.bus output.bus
I got the following error message:
Found 147456 barcodes in the whitelist Number of hamming dist 1 barcodes = 7895424 Error: barcode length and whitelist length differ, barcodes = 19, whitelist = 18 check that your whitelist matches the technology used
Ps.: I used the term -x inDrops
when pseudoaligning the reads
Hi! Glad you like the tools.
Currently the "correct" command doesn't work with variable length barcodes so your best bet is to skip the correct step. It is something we will work on in the future. All other steps should work just fine.
I will close this issue. If something else arises, feel free to open it back up.
Ok after discussing with Páll, we came up with a slight trick that may work:
Append an "A" to the beginning of each barcode until the length of the barcode is the length of the longest barcode, then do barcode error correction.
Here is the "whitelist" with the above modification: inDrops_modified_whitelist.txt
Try out
bustools correct -w ../inDrops_modified_whitelist.txt -o output.correct.bus output.bus
I'm curious to know if this works!
Hi Sina!
I really appreciate the attention in this thread. So! I tried the two ways: using the "A" appended trick and skipping the correction step.
1) Appending the "A"s and applying the correction, bustools gave me:
Found 147456 barcodes in the whitelist
Number of hamming dist 1 barcodes = 8292096
Processed 18141042 bus records
In whitelist = 2
Corrected = 21
Uncorrected = 18141019
My output for this was a dataset with very few barcodes passing correction (~20), and further analysis with BUSpaRse weren't possible.
2) I skipped the correction step, and not good but interestingly, I got a output with a enormous list of barcodes. I then compared the results with the other preprocessing pipeline that I was using, and further analysis with BUSpaRse (looking at the knee plot for example) showed me more then the double of barcodes simultaneously with half of the UMIs per barcode. I suspect that I have the same cell being counted as two, because the barcodes were not corrected, and the UMIs are split between them. To see this result, I had to mute the whitelist
argument when make_sparse_matrix
since the actual barcodes are uncorrected...
Thank you! I really appreciate all the tools developed in your lab!
Ok so I went through the kallisto bus code and currently kallisto bus will interpret the four fastq files associated with v3 inDrops as two separate sample files, as is the case for v1/v2. This will give you a BUS file with a huge number of barcodes and incorrect results.. this is not the desired behavior! So currently kallisto bus cannot be used for inDrops v3.
We will get working on v3 and will update the binaries when everything is functional.
Hi Sina, sorry for the late response! I was trying to figure out library issues.
So, actually I was using a v2 of inDrops library preparation, with 2 runs per sample.
What was interesting and I didn't know was that the inDrops total barcode has an adaptor sequence (that is constant GAGTGATTGCTTGTGACGCCTT) between the 2 "half"-barcode sequences (see image).
The length of barcodes vary between 38-41.
I added this adaptor sequence between the two half-barcodes as the following: barcodes_inDrops.txt
And when I applied the correction step by:
kallisto index -i MouseIndex Mus_musculus.GRCm38.cdna.all.fa
kallisto bus -i MouseIndex -o bus_output_ear/ -x inDrops -t 20 ear_R.fastq.gz ear_F.fastq.gz
cd bus_output_ear/
bustools correct -w ../barcodes_inDrops.txt -o output.correct.bus output.bus
I got the following error:
Found 672 barcodes in the whitelist
Number of hamming dist 1 barcodes = 48828
Error: barcode length and whitelist length differ, barcodes = 19, whitelist = 40
check that your whitelist matches the technology used
Does the whitelist has to necessarily have 19 nucleotides for the correction step? Thank you!
Hi Camila, I would probably run kallisto bus as follows: kallisto bus -i MouseIndex -o bus_output_ear/ -x 0,0,40:0,40,46:1,0,0 -t 20 ear_R.fastq.gz ear_F.fastq.gz cd bus_output_ear/
I think the problem is that maybe the authors used fastq files where the bridging adaptor was already removed (I am probably wrong though) so they would expect the format of barcodes to be max 19 bp.
Let me know if this works
Hi! Actually these are fastq from my lab, and I am working on them as is from the sequencing. So the adaptors must be there still.
The thing is that we recently acquired a InDrops encapsulator, we had the sequencer, and we are using the pipeline developed by the company all in python, which I am not pro in (I am a R person). kallisto/bustools called my attention because it is in R and I use kalisto for a long time now to work in bulk RNAseq files and teach transcriptomics. Just recently I started to work with single-cell RNAseq.
I appreciate the reply! Thank you!
Hi, I just wanted to come back and see if the team was still actively developing an option for indrops v3 ? I am stuck unfortunately at the moment, I have tried to do the following: Join R2 and R4 files: join <(zcat < indrop_R2.fastq.gz | nl ) <(zcat < indrop_R4.fastq.gz | nl) | awk -F ' ' '{if ($2 ~ /^@SeqID/ ) {print $4" "$5;} else {print $2" "$3;}}' | gzip > indrop_R5.fastq.gz
Then custom demultiplex with barcode_splitter: barcode_splitter --bcfile ../../../../Barcodes_indrop.txt indrop_R1.fastq.gz indrop_R3.fastq.gz indrop_R5.fastq.gz --idxread 2 --suffix .fastq
And then ran kallisto bus: kallisto bus -o output -x 0,0,16:0,16,22:1,0,0 -t 4 -i Mus_musculus.GRCm38.cdna.all.*idx demux-indrop_read-3.fastq demux-indrop_read-1.fastq
but end up having very low counts per barcode :(, so i am not sure what is going on (i have tried the conventional indrops pipeline, and it works fine for my files in terms of UMIs). Would you perhaps be able to share some information or help me out ??
Thanks !!
I'm having the exact same issue as Camila with our inDrops v2 chemistry after running the pseudoalignment step with kallisto bus using the "inDrops" argument for the technology.
Putting R1 first, as described by the tutorial, lead to a super low 100,000 reads aligned. So I swapped the order of R1 and R2 in the terminal command, due to R2 being the barcode and R1 being the transcript in the case of inDrops.
This resulted in a very impressive 100 million aligned reads of 150 million reads processed (on an i7-4790 with 16GB RAM in about 5 minutes), further tests proceeded with these 100 million alignments.
As I understand it, inDrops v2 barcode regions are structured as follows:
[variable length barcode1][22-bp W1 sequence][8-bp barcode2][6-bp UMI]
barcode 1 is taken from the pool found here and barcode 2 is taken from the pool found here.
Attempting to correct with just the barcode2 whitelist leads to:
Error: barcode length and whitelist length differ, barcodes = 19, whitelist = 15
Attempting to correct with the modified inDrops whitelist that Sina linked leads to very few corrections, just like Camila.
Which inDrops barcode chemistries are directly supported by kallisto bus as well as the downstream correction steps?
We have now finalized the workflow for processing inDrops v3, and have wrapped it into kb.
Could I please ask if you might be able to help me to understand how to process inDrops v3? Specifically, I'm wondering if kb can process inDrops v3 with 4 fastq files.
Thank you for any advice!
I have 4 fastq files, as described on the indrops page:
# For a V3 run with several parts, with several libraries that are not already demultiplexed
fastq_path : "{library_prefix}_{split_affix}_{read}.fastq.gz" # Read with be replaced by R1, R2, R3, R4 as appropriate.
split_affixes : ["L001", "L002", "L003", "L004"]
libraries : # The library index is what the expected index read sequence (on a NextSeq, this is the reverse complement of the index sequence)
- {library_name: "test_lib3", library_index: "ATAGAG"}
- {library_name: "test_lib4", library_index: "AGAGGA"}
From the indrops page, I can see:
v3 : summer 2016 redesign requiring manual demultiplexing. R1 is the biological read. R2 carries the first half of the gel barcode, R3 carries the library index and R4 the second half of the gel barcode, the UMI and a fraction of the polyA tail.
Here is another description of the 4 files from the Harvard Bioinformatics Core:
When I search for a relevant tutorial in the bus tools repo, this is what I find:
This tutorial expects 2 files (R1 R2), not 4 (R1 R2 R3 R4).
When I run kb --list
, I see:
$ kb --list
List of supported single-cell technologies
name whitelist provided barcode (file #, start, stop) umi (file #, start, stop) read file #
--------- ------------------ --------------------------------- ------------------------- -----------
10XV1 yes (2, 0, 0) (1, 0, 0) 0
10XV2 yes (0, 0, 16) (0, 16, 26) 1
10XV3 yes (0, 0, 16) (0, 16, 28) 1
CELSEQ (0, 0, 8) (0, 8, 12) 1
CELSEQ2 (0, 6, 12) (0, 0, 6) 1
DROPSEQ (0, 0, 12) (0, 12, 20) 1
INDROPSV1 (0, 0, 11) (0, 30, 38) (0, 42, 48) 1
INDROPSV2 (1, 0, 11) (1, 30, 38) (1, 42, 48) 0
INDROPSV3 yes (0, 0, 8) (1, 0, 8) (1, 8, 14) 2
SCRUBSEQ (0, 0, 6) (0, 6, 16) 1
SURECELL (0, 0, 6) (0, 21, 27) (0, 42, 48) (0, 51, 59) 1
It looks to me like the INDROPSV3
technology is expecting 2 files, but I have 4 files, and they're split across many lanes.
@slowkow If I understand the setup correctly, bustools expects 3 files for inDrops version 3. You should have demultiplexed by the library index (i5) with bcl2fastq
using --create-fastq-for-index-reads
. This will give up separate FASTQ files for each "sample" index in I1.
File 0: this is the second index (i7) given by bcl2fastq
as I2
File 1: this must contain the barcode and UMI and should be R1 (14 bp)
File 2: this contains the read from the transcript and should be R2 (61 bp)
So your "R3" is not needed as it is for which sample was sequenced, not the cell barcodes.
Can bustools now correct with variable barcode lengths for inDrops v2?
Just to echo what many people have talked about before, I am still having issue using the Kallisto Bus pipeline for inDrops v2 data. I am using Kallisto 0.46.2, which specifies a inDropsv2 option. This allowed input files R1 and R2 in that order with R1 being cDNA and R2 being CB+UMI. However, the barcode list generated from the pipeline is still only 19bp for all, which doesn't adjust for the variable length of barcode. Therefore, I am still getting many fold more cells than expected.
Thank you all for developing this amazing tool! Hope this issue can be resolved soon.
Ok so I went through the kallisto bus code and currently kallisto bus will interpret the four fastq files associated with v3 inDrops as two separate sample files, as is the case for v1/v2. This will give you a BUS file with a huge number of barcodes and incorrect results.. this is not the desired behavior! So currently kallisto bus cannot be used for inDrops v3.
We will get working on v3 and will update the binaries when everything is functional.
Hey there,
I was just wondering if kallisto bus producing bus files with higher number of barcodes were fixed for IndropsV3!
Thanks
Hi!
I have an inDrops dataset to preprocess and I am trying to run it using the recent bustools for scRNAseq. Unlike 10xGenomics, InDrops library V3 provides two list of barcodes - inDrops barcode lists where the first one is the beginning nucleotide sequence and the second one is the end part of the nucleotide sequence. Each list has 384 "half"-barcodes and the way that other preprocessing tools manage this is by remaking all possible combinations between the two list so we would have 384*384 = 147,456 barcodes, which is exact the number of barcodes that the inDrops platform claims to have.
I couldn't succeed in use the two .txt lists (instead one like the _10xv2whitelist.txt from 10x platform) as input in the step of correcting the barcodes in the busfile.
bustools correct -w ../gel_barcode1_list.txt gel_barcode2_list.txt -o output.correct.bus output.bus
How can me manage that?Finally, thank you Pachter Lab for this amazing tool that was recently released! I really appreciate it!