COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
746 stars 159 forks source link

Alevin didn't find barcode in frequency table #253

Closed habilzare closed 5 years ago

habilzare commented 5 years ago

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)? alevin

Describe the bug I am trying to use Alevin to quantify a single cell RNA-Seq 10x Genomics CHROMIUM dataset. I am using Salmon 0.10.2, and it does not produce the count matrix. The output folder contains nothing but the log files. Specifically, I get the following error: [2018-07-19 16:27:46.916] [alevinLog] [error] Barcode not found in frequency table The full messages are bellow.

To Reproduce Steps and data to reproduce the behavior: I explain how to download the data and reproduce the issue in the following.

Specifically, please provide at least the following information:

Expected behavior A clear and concise description of what you expected to happen. The count matrix be saved when Alevin is done.

Screenshots The full output messages are: Version Info: ### A newer version of Salmon is available. ####

The newest version, available at https://github.com/COMBINE-lab/salmon/releases contains new features, improvements, and bug fixes; please upgrade at your earliest convenience.

Logs will be written to alevin_output/logs

salmon (single-cell-based) v0.10.2

[ program ] => salmon

[ command ] => alevin

[ libType ] => { ISR }

[ mates1 ] => { SRR6327122_1.fastq.gz }

[ mates2 ] => { SRR6327122_2.fastq.gz }

[ chromium ] => { }

[ index ] => { index }

[ threads ] => { 2 }

[ output ] => { alevin_output }

[ tgMap ] => { transposon_sequence_set.fa.tsv }

[ whitelist ] => { cell_barcode_seq.txt }

[ dumpCsvCounts ] => { }

[2018-07-19 18:24:03.053] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored. [2018-07-19 18:24:03.059] [alevinLog] [info] Processing barcodes files (if Present)

processed 87 Million barcodes

[2018-07-19 18:26:13.307] [alevinLog] [info] Done barcode density calculation. [2018-07-19 18:26:13.307] [alevinLog] [info] # Barcodes Used: 86885223 / 87959276. [2018-07-19 18:26:13.334] [alevinLog] [info] Done importing white-list Barcodes [2018-07-19 18:26:13.334] [alevinLog] [info] Total 54879 white-listed Barcodes [2018-07-19 18:26:18.285] [alevinLog] [info] Done populating Z matrix [2018-07-19 18:26:18.300] [alevinLog] [info] Done indexing Barcodes [2018-07-19 18:26:18.301] [alevinLog] [info] Total Unique barcodes found: 978816 [2018-07-19 18:26:18.301] [alevinLog] [info] Used Barcodes except Whitelist: 26208 [2018-07-19 18:26:18.504] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2018-07-19 18:26:18.505] [alevinLog] [info] parsing read library format [2018-07-19 18:26:18.632] [stderrLog] [info] Loading Suffix Array [2018-07-19 18:26:18.641] [stderrLog] [info] Loading Transcript Info [2018-07-19 18:26:18.647] [stderrLog] [info] Loading Rank-Select Bit Array [2018-07-19 18:26:18.648] [stderrLog] [info] There were 179 set bits in the bit array [2018-07-19 18:26:18.648] [stderrLog] [info] Computing transcript lengths [2018-07-19 18:26:18.648] [stderrLog] [info] Waiting to finish loading hash [2018-07-19 18:26:18.720] [stderrLog] [info] Done loading index [2018-07-19 18:26:18.506] [jointLog] [info] There is 1 library. [2018-07-19 18:26:18.629] [jointLog] [info] Loading Quasi index [2018-07-19 18:26:18.631] [jointLog] [info] Loading 32-bit quasi index [2018-07-19 18:26:18.720] [jointLog] [info] done [2018-07-19 18:26:18.720] [jointLog] [info] Index contained 179 targets [2018-07-19 18:26:18.728] [alevinLog] [error] Barcode not found in frequency table

Desktop (please complete the following information):

Additional context I included a 10K subset of reads in the tarball, which leads to the same behavior by Alevin.

k3yavi commented 5 years ago

Hi @habilzare , Thanks a lot for using alevin and sharing a subset of the dataset which can reproduce the error effortlessly, it really saves us a lot of time !

If I may ask a quick question, how did we get the file cell_barcode_seq.txt for whitelist ? It looks like the file contains ~50k whitelisted CB, and only a very tiny fraction of them were actually found in the read/seq Fastq file.

I agree that we can work on putting a more informative error so that user can resolve it by themselves. The current assumption for alevin for whitelist CB is -- if provided with external whitelist then user is sure of the presence of the CB w/ good enough frequency.

k3yavi commented 5 years ago

Thinking more about it, we can actually throw away a whitelisted CB with 0 frequency w/ a warning . Once we discuss this w/ the alevin team, I'd happy to add this filter to the alevin pipeline .

Thanks again @habilzare for pointing this out .

habilzare commented 5 years ago

Hi @k3yavi , Thanks for looking at this isssue. The CB whitelist corresponds to a larger study, which includes the cells in these fastq files and many more. I'll rerun with only the 2K-4K cells that are related to this specific sample, and then I send you and update.

habilzare commented 5 years ago

Hi @k3yavi , Using the 5238 barcodes that are specific to this experiment, and also removing the quotes from the transcript to gene map file bcNotFound-2018-07-19b.tar.gz, this time Alevin finished with no error. However, I did not get a count matrix in csv format. Also, the quants_mat_cols.txt file is missing, and I do not know how to read the binary quants.mat file.

salmon alevin -l ISR -1 SRR6327122_1.fastq.gz -2 SRR6327122_2.fastq.gz --chromium -i index -p 2 -o alevin_output --tgMap transposon_sequence_set.fa.tsv --whitelist barcode_seq_5K.txt --dumpCsvCounts Version Info: ### A newer version of Salmon is available. ####

The newest version, available at https://github.com/COMBINE-lab/salmon/releases contains new features, improvements, and bug fixes; please upgrade at your earliest convenience.

Logs will be written to alevin_output/logs [2018-07-19 22:53:27.709] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.

salmon (single-cell-based) v0.10.2

[ program ] => salmon

[ command ] => alevin

[ libType ] => { ISR }

[ mates1 ] => { SRR6327122_1.fastq.gz }

[ mates2 ] => { SRR6327122_2.fastq.gz }

[ chromium ] => { }

[ index ] => { index }

[ threads ] => { 2 }

[ output ] => { alevin_output }

[ tgMap ] => { transposon_sequence_set.fa.tsv }

[ whitelist ] => { barcode_seq_5K.txt }

[ dumpCsvCounts ] => { }

[2018-07-19 22:53:27.714] [alevinLog] [info] Processing barcodes files (if Present)

processed 87 Million barcodes

[2018-07-19 22:55:37.299] [alevinLog] [info] Done barcode density calculation. [2018-07-19 22:55:37.299] [alevinLog] [info] # Barcodes Used: 86885223 / 87959276. [2018-07-19 22:55:37.303] [alevinLog] [info] Done importing white-list Barcodes [2018-07-19 22:55:37.303] [alevinLog] [info] Total 5238 white-listed Barcodes [2018-07-19 22:55:37.675] [alevinLog] [info] Done populating Z matrix [2018-07-19 22:55:37.683] [alevinLog] [info] Done indexing Barcodes [2018-07-19 22:55:37.683] [alevinLog] [info] Total Unique barcodes found: 978816 [2018-07-19 22:55:37.683] [alevinLog] [info] Used Barcodes except Whitelist: 20705 [2018-07-19 22:55:38.386] [jointLog] [info] There is 1 library. [2018-07-19 22:55:38.493] [jointLog] [info] Loading Quasi index [2018-07-19 22:55:38.494] [jointLog] [info] Loading 32-bit quasi index [2018-07-19 22:55:38.549] [jointLog] [info] done [2018-07-19 22:55:38.549] [jointLog] [info] Index contained 179 targets

[2018-07-19 22:55:38.385] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2018-07-19 22:55:38.385] [alevinLog] [info] parsing read library format [2018-07-19 22:55:38.495] [stderrLog] [info] Loading Suffix Array [2018-07-19 22:55:38.498] [stderrLog] [info] Loading Transcript Info [2018-07-19 22:55:38.499] [stderrLog] [info] Loading Rank-Select Bit Array [2018-07-19 22:55:38.500] [stderrLog] [info] There were 179 set bits in the bit array [2018-07-19 22:55:38.501] [stderrLog] [info] Computing transcript lengths [2018-07-19 22:55:38.501] [stderrLog] [info] Waiting to finish loading hash processed 87 Million fragmentserrLog] [info] Done loading index hits: 468892, hits per frag: 0.00535907

[2018-07-19 23:03:35.740] [jointLog] [info] Computed 150 rich equivalence classes for further processing [2018-07-19 23:03:35.740] [jointLog] [info] Counted 412868 total reads in the equivalence classes [2018-07-19 23:03:35.741] [jointLog] [warning] Only 412868 fragments were mapped, but the number of burn-in fragments was set to 5000000. The effective lengths have been computed using the observed mappings.

[2018-07-19 23:03:35.741] [jointLog] [info] Mapping rate = 0.469385%

[2018-07-19 23:03:35.741] [jointLog] [info] finished quantifyLibrary() [2018-07-19 23:03:35.755] [alevinLog] [info] Starting optimizer

Analyzed 5238 cells (100% of all). Skipped Barcodes are from High Confidence Region $ls -ltrha alevin_output/alevin/ total 256K drwxrwx--- 6 zare G-816158 4.0K Jul 19 22:36 .. -rw-rw---- 1 zare G-816158 960 Jul 19 23:03 alevin.log drwxrwx--- 2 zare G-816158 4.0K Jul 19 23:03 . -rw-rw---- 1 zare G-816158 81K Jul 19 23:03 quants_mat_rows.txt -rw-rw---- 1 zare G-816158 160K Jul 19 23:03 quants_mat.gz

k3yavi commented 5 years ago

Hi @habilzare , re: importing binary format: this tutorial might help more of how to read the binary output. Specifically, if you are familiar with python then this function can be used for importing the binary.

However, I reckon that alevin hasn't successfully completed. It looks like one of the whitelisted CB ended up having no read at all after deduplicating. We usually assume it should not happen since a whitelisted CB should have at least some count after deduplicating UMIs, which in your case can happen since the mapping rate looks like is surprisingly low 0.469385% (is this expected ?).

Skipped Barcodes are from High Confidence Region is actually an error, again we need more informative and elegant exit from the pipeline once Alevin fails.

k3yavi commented 5 years ago

Slight Correction on the above statement "It looks like one of the whitelisted CB ended up having no read at all after ~deduplicating~ mapping." We might have to tweak a bit in the current version of Alevin for use cases like yours where we don't wan't pipeline to fail if the whitelisted CB are either w/ less frequency / mapping rate / deduplication rate.

habilzare commented 5 years ago

The low rate of mapping is expected because in this study I am interested only in the transposons. Thanks @k3yavi for requesting the feature. I need this analysis to be done very urgently. Until you add the feature, I am going to implement the following workaround: I will create a pair of dummy fastq files that have exactly 1 read for each barcode in my whitelist. I will add these to my current fastq file when running Alevin. After I get the count matrix, I will subtract 1 from each entry.

k3yavi commented 5 years ago

Yep that should work too, although a couple of minor things, you might wanna use 10 reads since that's the lower bound. Secondly, those reads should be mapped too, I guess copy the 98 length sequence from the transposons's region. Lastly the UMI, if the UMI sequence overlap w/ already present UMI sequence then it can potentially effect the deduplication of cell having more than 10 counts, you might wanna chose a disjoint UMI sequence not in your dataset and reduce the count by 1 in the count matrix since if all newly added UMI are same and get mapped to same gene then it will be deduplicated to 1.

habilzare commented 5 years ago

I wrote the 'make.dummy.fastq() function in R, which addressed the issue as planned in above.

For some of the samples, I got the "umi indexing of jellyfish failing" error, which I realized from this report must be related to shorter than 26 bps reads. I used trimommatic to exclude those short reads, and finally Alevin produced the count matrix in CSV format for me. Thank you Salmon team!

Minor: The output CSV file has an extra comma at the end of each line, which I ignored in my post-processing step.

k3yavi commented 5 years ago

@habilzare Glad to hear that it worked out and thanks for fast and useful moderation from default alevin pipeline. We will work on correcting the suggestions made in the next release of Salmon.

patrickvdb commented 5 years ago

Hi, I have the exact same issue, in that it is quite likely that some barcodes will be empty and I don't obtain any results now. It seems that you (@k3yavi) closed the feature request when you closed this issue. was that your intention? I'll try to get the work around working, but it would be nice if there is a flag we could set in the future to ignore this error when running salmon.

k3yavi commented 5 years ago

Hi @patrickvdb , Thanks for pointing this out. My impression was there were two issues here:

  1. The presence of whitelist CB w/ 0 frequency. (Subject of this issue).
  2. The presence of whitelist CB w/ 0 mapping reads.

The solution of the first problem is pretty straight forward and since it was the subject of the issue I thought closing this one and opening a new w/ the second problem would be better thing to do. We are in the process of fixing these two issues and my plan was to open and close a new issue w/ the fix but you are right, either this issue should not have been closed or a new should have been created. We'll update this soon, thanks for bringing this to our attention.

k3yavi commented 5 years ago

Just a heads up, issue #266 has been added and the solution is currently available in the source build from the develop branch. We will include this to master with the next planned release of Salmon v0.11.3. Thanks again for the useful feedbacks and comments.