COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
777 stars 165 forks source link

Alevin "incorrect call for umi extract" with --dropseq option #258

Closed kh49 closed 6 years ago

kh49 commented 6 years ago

Hi,

I'm having issues getting alevin to work on dropseq data after following the tutorial for setting it up.

I am using the following command to run it:

salmon alevin -l ISR -1 SRR6054189.sra_1.fastq -2 SRR6054189.sra_2.fastq --dropseq -i ~/Documents/CordBlood/data/index_15 -p 10 -o ~/Documents/CordBlood/data/alevin_out --tgMap ~/Documents/CordBlood/data/txp2gene.tsv --dumpCsvCounts

and eventually get "Incorrect call for umi extract"

Here's the full output:

salmon alevin -l ISR -1 SRR6054189.sra_1.fastq -2 SRR6054189.sra_2.fastq --dropseq -i ~/Documents/CordBlood/data/index_15 -p 4 -o ~/Documents/CordBlood/data/alevin4p_out_combined --tgMap ~/Documents/CordBlood/data/txp2gene.tsv --dumpCsvCounts
Version Info: This is the most recent version of Salmon.
Logs will be written to ~/Documents/CordBlood/data/alevin4p_out_combined/logs
### salmon (single-cell-based) v0.11.0
### [ program ] => salmon 
### [ command ] => alevin 
### [ libType ] => { ISR }
### [ mates1 ] => { SRR6054189.sra_1.fastq }
### [ mates2 ] => { SRR6054189.sra_2.fastq }
### [ dropseq ] => { }
### [ index ] => {~/Documents/CordBlood/data/index_15 }
### [ threads ] => { 4 }
### [ output ] => {~/Documents/CordBlood/data/alevin4p_out_combined }
### [ tgMap ] => {~/Documents/CordBlood/data/txp2gene.tsv }
### [ dumpCsvCounts ] => { }

[2018-07-26 11:15:08.510] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2018-07-26 11:15:08.524] [alevinLog] [info] Processing barcodes files (if Present) 

processed 471 Million barcodes

[2018-07-26 11:25:20.231] [alevinLog] [info] Done barcode density calculation.
[2018-07-26 11:25:20.231] [alevinLog] [info] # Barcodes Used: 470701906 / 471465434.
[2018-07-26 11:25:30.228] [alevinLog] [info] Knee found left boundary at  202 
[2018-07-26 11:25:31.135] [alevinLog] [info] Gauss Corrected Boundary at  22 
[2018-07-26 11:25:31.135] [alevinLog] [info] Learned InvCov: 1044.2 normfactor: 295.235
[2018-07-26 11:25:31.135] [alevinLog] [info] Total 222(has 200 low confidence) barcodes
[2018-07-26 11:25:31.440] [alevinLog] [info] Done True Barcode Sampling
[2018-07-26 11:25:31.789] [alevinLog] [info] Done populating Z matrix
[2018-07-26 11:25:31.793] [alevinLog] [info] Done indexing Barcodes
[2018-07-26 11:25:31.793] [alevinLog] [info] Total Unique barcodes found: 10630133
[2018-07-26 11:25:31.793] [alevinLog] [info] Used Barcodes except Whitelist: 10603
[2018-07-26 11:25:31.938] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2018-07-26 11:25:31.939] [alevinLog] [info] parsing read library format
[2018-07-26 11:25:31.949] [jointLog] [info] There is 1 library.
[2018-07-26 11:25:32.331] [jointLog] [info] Loading Quasi index
[2018-07-26 11:25:32.331] [jointLog] [info] Loading 32-bit quasi index
[2018-07-26 11:25:32.357] [stderrLog] [info] Loading Suffix Array 
[2018-07-26 11:26:09.413] [stderrLog] [info] Loading Transcript Info 
[2018-07-26 11:26:10.896] [stderrLog] [info] Loading Rank-Select Bit Array
[2018-07-26 11:26:11.159] [stderrLog] [info] There were 203027 set bits in the bit array
[2018-07-26 11:26:11.225] [stderrLog] [info] Computing transcript lengths
[2018-07-26 11:26:11.226] [stderrLog] [info] Waiting to finish loading hash
[2018-07-26 11:26:14.654] [stderrLog] [info] Done loading index
[2018-07-26 11:26:14.654] [jointLog] [info] done
[2018-07-26 11:26:14.654] [jointLog] [info] Index contained 203027 targets

Incorrect call for umi extractIncorrect call for umi extract

I traced it back to AlevinUtils.cpp in the source files but could not make sense of it from there.

The program will run completely on the same data and library if I change --dropseq to --Chromium, eventually outputting the following after processing the reads:

[2018-07-24 10:56:20.712] [jointLog] [info] Computed 9968 rich equivalence classes for further processing
[2018-07-24 10:56:20.712] [jointLog] [info] Counted 2785976 total reads in the equivalence classes
[2018-07-24 10:56:20.729] [jointLog] [warning] Only 2785976 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

[2018-07-24 10:56:20.729] [jointLog] [warning] Found 82730645 reads with CB+UMI length smaller than expected.Please report on github if this number is too large
[2018-07-24 10:56:20.729] [jointLog] [info] Mapping rate = 0.590918%

and then this after processing the cells:

[2018-07-24 10:56:23.180] [alevinLog] [info] Total 21135 UMI after deduplicating.
[2018-07-24 10:56:23.180] [alevinLog] [warning] Skipped 4 barcodes due to No mapped read
[2018-07-24 10:56:23.213] [alevinLog] [info] Clearing EqMap; Might take some time.
[2018-07-24 10:56:23.230] [alevinLog] [info] Starting Import of the gene count matrix.
[2018-07-24 10:56:23.743] [alevinLog] [info] Done Importing gene count matrix for dimension 290x57964
[2018-07-24 10:56:23.743] [alevinLog] [info] Starting dumping cell v gene counts in csv format
[2018-07-24 10:56:29.089] [alevinLog] [info] Finished dumping csv counts
[2018-07-24 10:56:29.089] [alevinLog] [info] Starting white listing
[2018-07-24 10:56:29.090] [alevinLog] [info] Done importing order of barcodes "quants_mat_rows.txt" file.
[2018-07-24 10:56:29.090] [alevinLog] [info] Total 290 barcodes found
[2018-07-24 10:56:29.090] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting
[2018-07-24 10:56:29.090] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting
[2018-07-24 10:56:29.090] [alevinLog] [info] Starting to make feature Matrix
[2018-07-24 10:56:29.354] [alevinLog] [info] Done making regular featues
[2018-07-24 10:56:29.354] [alevinLog] [info] Done making feature Matrix
[2018-07-24 10:56:29.359] [alevinLog] [info] Finished white listing
[2018-07-24 10:56:29.371] [alevinLog] [info] Finished optimizer

Other info: Salmon v0.11.0 - downloaded binary from Github I used Gencode 28 for the transcriptome read files: https://www.ncbi.nlm.nih.gov/sra/SRX2676721[accn]

OS: CentOS version: 2.6.32-696.23.1.el6.centos.plus.x86_64 LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch Distributor ID: CentOS Description: CentOS release 6.9 (Final) Release: 6.9 Codename: Final

k3yavi commented 6 years ago

Hi @pophipi , Thanks for reporting the issue. It looks like a bug has creeped in while merging the drop-seq pipeline to alevin here. If it's an urgent requirement you can swap the above two lines by:

      umi = read.substr(pt.barcodeLength, pt.umiLength);
      return true;

If it's too much trouble to compile from source, we will release a version w/ the hot-fix by today/tomorrow and would update you soon. Thanks again !

P.S: I was curious how did the chromium pipeline went through, since the length requirement of 10x based pipeline is longer and it should break much earlier. The experiment you forwarded above seems to have 25 length bases for CB+UMI sequences. I wonder has any of the Drop-seq guideline changed? I was in the impression it was 12 base CB and 8 base UMI if not, then --dropseq flag would not be ideal thing to use since it will just use 20 bases out of 25 present in the fastq files.

kh49 commented 6 years ago

hi @k3yavi,

Thanks for your help! I'm glad it's a quick fix. As for the dataset, I am not sure why the read length is 25bp. The paper I pulled it from stated that they used the standard DropSeq protocol and did not seem to mention and changes in CB and UMI length. In the case that they did change those lengths, what options can I use to set the pipeline?

k3yavi commented 6 years ago

We might have to go through the paper and the dropseq guidelines to check what really changed. You might wanna check https://github.com/COMBINE-lab/salmon/issues/247, we actually have a hidden option to do customized umi/CB length options, however this goes into a little more unexplored territory and requires a bit more testing. We'd appreciate your feedback if you happen to run this mode.

k3yavi commented 6 years ago

hey @pophipi , we have released v0.11.1 with the fix. Thanks for reporting this issue.

Also, I tried running the dropseq mode for Alevin w/ the data you forwarded but it looks like the mapping rate is too low. I am not sure how to interpret the data, but just for sanity Alevin mapping rate is ~70% in the original Macosko et. al. paper. We will keep looking for the updates in DropSeq pipeline feel free to reopen this issue or create a new one regarding the low mapping rate if you find out the right location of UMI and CB in the dataset or trouble using #247 .

kh49 commented 6 years ago

@k3yavi To follow up on this dataset: The reads were generated using a modified protocol with a 9bp barcode followed by an 8bp UMI. I used the custom length mode to align this data and alignment rate went up to about 45%.

I tried the alignment using DropSeq Tools and STAR and got similar alignment rates, so I think the custom length alignment is working properly. I may try using some other reference databases instead of GRCh38.p12 to see if alignment improves. Otherwise it may just be an issue regarding the dataset itself.

k3yavi commented 6 years ago

Glad to hear that, let us know if you need any other help or have suggestions / feedbacks to improve Alevin.