COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
777 stars 165 forks source link

Alevin whitelist question #428

Closed annajbott closed 5 years ago

annajbott commented 5 years ago

Hi, I've ran Alevin and generated an alevinQC report. The initial whitelist contains 5261 cells and the final whitelist contains 4340 cells. filtered_cb_frequency.txt contains 5261 cells and whitelist.txt contains 4240 cells. AlevinQC states that "Once the initial set of whitelisted cell barcodes is defined, Alevin goes through the remaining cell barcodes. If a cell barcode is similar enough to a whitelisted cell barcode, it will be corrected and the reads will be added to those of the whitelisted one." However my final counts matrix contains 5621 cells, the number from the initial whitelist. Shouldn't my final counts matrix contain 4340 cells after the correction has taken place? I'm running:

salmon alevin -l ISR -1 test.fastq.1.gz -2 test.fastq.2.gz --chromium -i geneset.dir/geneset_all.salmon.index -p 16 -o salmon.dir/test
--tgMap t2gmap.tsv --dumpFeatures --dumpUmiGraph

Thanks, Anna

k3yavi commented 5 years ago

Hi @annajbott ,

Thanks for your question. It's an expected behavior. The idea is to dump some low confidence CB as well for certain kind of downstream processing. You'd see a file whitelist.txt as well in the output alevin folder which should contain whitelisted CB names (4340 in your case). You might have to filter those matrix out after loading the full matrix to get cells only passes the whitelisting filter. Please checkout tximport to import the matrix in R, it's very efficient to load. In case you need some stats regarding the resource usage check EDS.

annajbott commented 5 years ago

Okay cool cheers, I've subset the matrix using whitelist.txt. Thanks!