BUStools / bustools

Tools for working with BUS files
https://bustools.github.io/
BSD 2-Clause "Simplified" License
92 stars 23 forks source link

inDrops barcode list #4

Closed camilafarias112 closed 5 years ago

camilafarias112 commented 5 years ago

Hi!

I have an inDrops dataset to preprocess and I am trying to run it using the recent bustools for scRNAseq. Unlike 10xGenomics, InDrops library V3 provides two list of barcodes - inDrops barcode lists where the first one is the beginning nucleotide sequence and the second one is the end part of the nucleotide sequence. Each list has 384 "half"-barcodes and the way that other preprocessing tools manage this is by remaking all possible combinations between the two list so we would have 384*384 = 147,456 barcodes, which is exact the number of barcodes that the inDrops platform claims to have.

I couldn't succeed in use the two .txt lists (instead one like the _10xv2whitelist.txt from 10x platform) as input in the step of correcting the barcodes in the busfile.

bustools correct -w ../gel_barcode1_list.txt gel_barcode2_list.txt -o output.correct.bus output.bus How can me manage that?

Finally, thank you Pachter Lab for this amazing tool that was recently released! I really appreciate it!

camilafarias112 commented 5 years ago

Also, when I made the list of 147,456 barcodes, by making a new .txt file with all possible combinations of "half"-barcodes barcodes_inDrops.txt

By doing this: bustools correct -w ../barcodes_inDrops.txt -o output.correct.bus output.bus

I got the following error message:

Found 147456 barcodes in the whitelist Number of hamming dist 1 barcodes = 7895424 Error: barcode length and whitelist length differ, barcodes = 19, whitelist = 18 check that your whitelist matches the technology used

Ps.: I used the term -x inDrops when pseudoaligning the reads

sbooeshaghi commented 5 years ago

Hi! Glad you like the tools.

Currently the "correct" command doesn't work with variable length barcodes so your best bet is to skip the correct step. It is something we will work on in the future. All other steps should work just fine.

I will close this issue. If something else arises, feel free to open it back up.

sbooeshaghi commented 5 years ago

Ok after discussing with Páll, we came up with a slight trick that may work:

Append an "A" to the beginning of each barcode until the length of the barcode is the length of the longest barcode, then do barcode error correction.

Here is the "whitelist" with the above modification: inDrops_modified_whitelist.txt

Try out bustools correct -w ../inDrops_modified_whitelist.txt -o output.correct.bus output.bus

I'm curious to know if this works!

camilafarias112 commented 5 years ago

Hi Sina!

I really appreciate the attention in this thread. So! I tried the two ways: using the "A" appended trick and skipping the correction step.

1) Appending the "A"s and applying the correction, bustools gave me:

Found 147456 barcodes in the whitelist
Number of hamming dist 1 barcodes = 8292096
Processed 18141042 bus records
In whitelist = 2
Corrected = 21
Uncorrected = 18141019

My output for this was a dataset with very few barcodes passing correction (~20), and further analysis with BUSpaRse weren't possible.

2) I skipped the correction step, and not good but interestingly, I got a output with a enormous list of barcodes. I then compared the results with the other preprocessing pipeline that I was using, and further analysis with BUSpaRse (looking at the knee plot for example) showed me more then the double of barcodes simultaneously with half of the UMIs per barcode. I suspect that I have the same cell being counted as two, because the barcodes were not corrected, and the UMIs are split between them. To see this result, I had to mute the whitelist argument when make_sparse_matrix since the actual barcodes are uncorrected...

Thank you! I really appreciate all the tools developed in your lab!

sbooeshaghi commented 5 years ago

Ok so I went through the kallisto bus code and currently kallisto bus will interpret the four fastq files associated with v3 inDrops as two separate sample files, as is the case for v1/v2. This will give you a BUS file with a huge number of barcodes and incorrect results.. this is not the desired behavior! So currently kallisto bus cannot be used for inDrops v3.

We will get working on v3 and will update the binaries when everything is functional.

camilafarias112 commented 5 years ago

Hi Sina, sorry for the late response! I was trying to figure out library issues.

So, actually I was using a v2 of inDrops library preparation, with 2 runs per sample.

What was interesting and I didn't know was that the inDrops total barcode has an adaptor sequence (that is constant GAGTGATTGCTTGTGACGCCTT) between the 2 "half"-barcode sequences (see image).

InDrops_shapeFinal

The length of barcodes vary between 38-41.

I added this adaptor sequence between the two half-barcodes as the following: barcodes_inDrops.txt

And when I applied the correction step by:

kallisto index -i MouseIndex Mus_musculus.GRCm38.cdna.all.fa
kallisto bus -i MouseIndex -o bus_output_ear/ -x inDrops -t 20 ear_R.fastq.gz ear_F.fastq.gz
cd bus_output_ear/
bustools correct -w ../barcodes_inDrops.txt -o output.correct.bus output.bus

I got the following error:

Found 672 barcodes in the whitelist
Number of hamming dist 1 barcodes = 48828
Error: barcode length and whitelist length differ, barcodes = 19, whitelist = 40
       check that your whitelist matches the technology used

Does the whitelist has to necessarily have 19 nucleotides for the correction step? Thank you!

droplet-lab commented 5 years ago

Hi Camila, I would probably run kallisto bus as follows: kallisto bus -i MouseIndex -o bus_output_ear/ -x 0,0,40:0,40,46:1,0,0 -t 20 ear_R.fastq.gz ear_F.fastq.gz cd bus_output_ear/

I think the problem is that maybe the authors used fastq files where the bridging adaptor was already removed (I am probably wrong though) so they would expect the format of barcodes to be max 19 bp.

Let me know if this works

camilafarias112 commented 5 years ago

Hi! Actually these are fastq from my lab, and I am working on them as is from the sequencing. So the adaptors must be there still.

The thing is that we recently acquired a InDrops encapsulator, we had the sequencer, and we are using the pipeline developed by the company all in python, which I am not pro in (I am a R person). kallisto/bustools called my attention because it is in R and I use kalisto for a long time now to work in bulk RNAseq files and teach transcriptomics. Just recently I started to work with single-cell RNAseq.

I appreciate the reply! Thank you!

droplet-lab commented 5 years ago

Hi, I just wanted to come back and see if the team was still actively developing an option for indrops v3 ? I am stuck unfortunately at the moment, I have tried to do the following: Join R2 and R4 files: join <(zcat < indrop_R2.fastq.gz | nl ) <(zcat < indrop_R4.fastq.gz | nl) | awk -F ' ' '{if ($2 ~ /^@SeqID/ ) {print $4" "$5;} else {print $2" "$3;}}' | gzip > indrop_R5.fastq.gz

Then custom demultiplex with barcode_splitter: barcode_splitter --bcfile ../../../../Barcodes_indrop.txt indrop_R1.fastq.gz indrop_R3.fastq.gz indrop_R5.fastq.gz --idxread 2 --suffix .fastq

And then ran kallisto bus: kallisto bus -o output -x 0,0,16:0,16,22:1,0,0 -t 4 -i Mus_musculus.GRCm38.cdna.all.*idx demux-indrop_read-3.fastq demux-indrop_read-1.fastq

but end up having very low counts per barcode :(, so i am not sure what is going on (i have tried the conventional indrops pipeline, and it works fine for my files in terms of UMIs). Would you perhaps be able to share some information or help me out ??

Thanks !!

bobchen1701 commented 5 years ago

I'm having the exact same issue as Camila with our inDrops v2 chemistry after running the pseudoalignment step with kallisto bus using the "inDrops" argument for the technology.

Putting R1 first, as described by the tutorial, lead to a super low 100,000 reads aligned. So I swapped the order of R1 and R2 in the terminal command, due to R2 being the barcode and R1 being the transcript in the case of inDrops.

This resulted in a very impressive 100 million aligned reads of 150 million reads processed (on an i7-4790 with 16GB RAM in about 5 minutes), further tests proceeded with these 100 million alignments.

As I understand it, inDrops v2 barcode regions are structured as follows:

[variable length barcode1][22-bp W1 sequence][8-bp barcode2][6-bp UMI]

barcode 1 is taken from the pool found here and barcode 2 is taken from the pool found here.

Attempting to correct with just the barcode2 whitelist leads to: Error: barcode length and whitelist length differ, barcodes = 19, whitelist = 15

Attempting to correct with the modified inDrops whitelist that Sina linked leads to very few corrections, just like Camila.

Which inDrops barcode chemistries are directly supported by kallisto bus as well as the downstream correction steps?

lakigigar commented 5 years ago

We have now finalized the workflow for processing inDrops v3, and have wrapped it into kb.

slowkow commented 4 years ago

Could I please ask if you might be able to help me to understand how to process inDrops v3? Specifically, I'm wondering if kb can process inDrops v3 with 4 fastq files.

Thank you for any advice!


I have 4 fastq files, as described on the indrops page:

   # For a V3 run with several parts, with several libraries that are not already demultiplexed
    fastq_path : "{library_prefix}_{split_affix}_{read}.fastq.gz" # Read with be replaced by R1, R2, R3, R4 as appropriate.
    split_affixes : ["L001", "L002", "L003", "L004"]
    libraries :  # The library index is what the expected index read sequence (on a NextSeq, this is the reverse complement of the index sequence)
      - {library_name: "test_lib3", library_index: "ATAGAG"}
      - {library_name: "test_lib4", library_index: "AGAGGA"}

From the indrops page, I can see:

v3 : summer 2016 redesign requiring manual demultiplexing. R1 is the biological read. R2 carries the first half of the gel barcode, R3 carries the library index and R4 the second half of the gel barcode, the UMI and a fraction of the polyA tail.

Here is another description of the 4 files from the Harvard Bioinformatics Core:

image

When I search for a relevant tutorial in the bus tools repo, this is what I find:

https://github.com/BUStools/BUS_notebooks_python/blob/master/dataset-notebooks/indrops_GSM2746895_python/indrops_brain_activity.ipynb

This tutorial expects 2 files (R1 R2), not 4 (R1 R2 R3 R4).

When I run kb --list, I see:

$ kb --list
List of supported single-cell technologies

name         whitelist provided    barcode (file #, start, stop)        umi (file #, start, stop)    read file #
---------    ------------------    ---------------------------------    -------------------------    -----------
10XV1        yes                   (2, 0, 0)                            (1, 0, 0)                    0
10XV2        yes                   (0, 0, 16)                           (0, 16, 26)                  1
10XV3        yes                   (0, 0, 16)                           (0, 16, 28)                  1
CELSEQ                             (0, 0, 8)                            (0, 8, 12)                   1
CELSEQ2                            (0, 6, 12)                           (0, 0, 6)                    1
DROPSEQ                            (0, 0, 12)                           (0, 12, 20)                  1
INDROPSV1                          (0, 0, 11) (0, 30, 38)               (0, 42, 48)                  1
INDROPSV2                          (1, 0, 11) (1, 30, 38)               (1, 42, 48)                  0
INDROPSV3    yes                   (0, 0, 8) (1, 0, 8)                  (1, 8, 14)                   2
SCRUBSEQ                           (0, 0, 6)                            (0, 6, 16)                   1
SURECELL                           (0, 0, 6) (0, 21, 27) (0, 42, 48)    (0, 51, 59)                  1

It looks to me like the INDROPSV3 technology is expecting 2 files, but I have 4 files, and they're split across many lanes.

TomKellyGenetics commented 4 years ago

@slowkow If I understand the setup correctly, bustools expects 3 files for inDrops version 3. You should have demultiplexed by the library index (i5) with bcl2fastq using --create-fastq-for-index-reads. This will give up separate FASTQ files for each "sample" index in I1.

File 0: this is the second index (i7) given by bcl2fastq as I2 File 1: this must contain the barcode and UMI and should be R1 (14 bp) File 2: this contains the read from the transcript and should be R2 (61 bp)

So your "R3" is not needed as it is for which sample was sequenced, not the cell barcodes.

kushtimusPrime commented 4 years ago

Can bustools now correct with variable barcode lengths for inDrops v2?

sbdaxia commented 4 years ago

Just to echo what many people have talked about before, I am still having issue using the Kallisto Bus pipeline for inDrops v2 data. I am using Kallisto 0.46.2, which specifies a inDropsv2 option. This allowed input files R1 and R2 in that order with R1 being cDNA and R2 being CB+UMI. However, the barcode list generated from the pipeline is still only 19bp for all, which doesn't adjust for the variable length of barcode. Therefore, I am still getting many fold more cells than expected.

Thank you all for developing this amazing tool! Hope this issue can be resolved soon.

Artur-man commented 3 years ago

Ok so I went through the kallisto bus code and currently kallisto bus will interpret the four fastq files associated with v3 inDrops as two separate sample files, as is the case for v1/v2. This will give you a BUS file with a huge number of barcodes and incorrect results.. this is not the desired behavior! So currently kallisto bus cannot be used for inDrops v3.

We will get working on v3 and will update the binaries when everything is functional.

Hey there,

I was just wondering if kallisto bus producing bus files with higher number of barcodes were fixed for IndropsV3!

Thanks