Identifying correct library type for SPLiT-seq/ParseBio reads

jeremymsimon commented 3 years ago

Hey @rob-p and @k3yavi! Avi and I have chatted a bit about this offline, and I've commented about this in the past, but I'm currently attempting to extend alevin's use to the newer combinatorial indexing experiments like SPLiT-seq or the commercial solution, ParseBio (the biotech formerly known as SplitBio). Now that you support different barcode geometries and allow the user to specify where these barcodes should be expected, this should be doable.

I've downloaded the raw data from the original SPLiT-seq paper, and we have some ParseBio data of our own; my goal is to help you all write a vignette on this once we get the details worked out.

In short, the R1 file contains cDNA sequence, and R2 contains a 10bp UMI followed by 3 different barcodes in the following arrangements (note: ParseBio seems to shift BC1 up a few bp, to 79-86):

UMI: 1-10
BC3: 11-18
BC2: 49-56
BC1: 87-94

The intervening bases are a combination of common and variable sequences.

The other wrinkle with these approaches is that the cDNA can be amplified using either oligo-dT or random hexamer barcodes. The net effect of this is that BC1 can be one of two options, and some barcode combinations need to be pooled to represent the same cell. For example:

AACGTGAT-CTGTAGCC-ACACAGAA GATAGACA-CTGTAGCC-ACACAGAA

may be the same cell, amplified by two different means.

We know how the BC1s are paired, so Avi suggested pre-processing the R2 FASTQ file such that we locate random hexamer BC1 sequences and modify them to the matched oligo-dT sequence. In the above example, we'd correct some R2 reads that contain "GATAGACA" in the 87-94 position to "AACGTGAT". I wrote a script (in perl because I'm old-school) that does this and allows for mismatches within 1 hamming distance. I'm sure someone could port this to python, rust, or C/C++ and speed this step up substantially, but it seems to work fine.

The next step is to run alevin. I've done so using salmon v1.5.2 with the following parameters (for a ParseBio run), reversing R1/R2 as per Avi's suggestion:

salmon alevin -l ISR \
--expectCells 9000 \
--read-geometry 2[1-end] \
--umi-geometry 1[1-10] \
--bc-geometry 1[11-18,49-56,79-86] \
-2 R1.fastq.gz \
-1 R2_corrected.fastq.gz \
-i gencode_v36_transcripts_idx \
-p 8 \
-o SplitBio_SL2_alevin \
--tgMap gencode.v36.txt

I know that these libraries are stranded, but I doubt ISR is the correct architecture, especially given it's not true PE sequencing. Notably, running alevin this way gets me only a 4% mapping rate, and most of my cells get filtered out downstream due to having really low depth.

Below is my salmon_quant.log:

[2021-08-20 12:29:32.343] [jointLog] [info] setting maxHashResizeThreads to 8
[2021-08-20 12:29:32.343] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2021-08-20 12:29:32.343] [jointLog] [info] The --mimicBT2, --mimicStrictBT2 and --hardFilter flags imply mapping validation (--validateMappings). Enabling mapping validation.
[2021-08-20 12:29:32.343] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2021-08-20 12:29:32.343] [jointLog] [info] The use of range-factorized equivalence classes does not make sense in conjunction with --hardFilter.  Disabling range-factorized equivalence classes. 
[2021-08-20 12:29:32.343] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2021-08-20 12:29:32.343] [jointLog] [info] Using default value of 0.87 for minScoreFraction in Alevin
Using default value of 0.6 for consensusSlack in Alevin
[2021-08-20 13:19:51.055] [jointLog] [info] There is 1 library.
[2021-08-20 13:19:51.124] [jointLog] [info] Loading pufferfish index
[2021-08-20 13:19:51.125] [jointLog] [info] Loading dense pufferfish index.
[2021-08-20 13:19:53.706] [jointLog] [info] done
[2021-08-20 13:19:53.706] [jointLog] [info] Index contained 231,288 targets
[2021-08-20 13:19:53.777] [jointLog] [info] Number of decoys : 0
[2021-08-20 16:03:00.061] [jointLog] [info] Computed 307,354 rich equivalence classes for further processing
[2021-08-20 16:03:00.061] [jointLog] [info] Counted 46,363,434 total reads in the equivalence classes 
[2021-08-20 16:03:00.062] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2021-08-20 16:03:00.062] [jointLog] [warning] Found 149640 reads with `N` in the UMI sequence and ignored the reads.
Please report on github if this number is too large
[2021-08-20 16:03:00.062] [jointLog] [info] Mapping rate = 4.01974%

[2021-08-20 16:03:00.062] [jointLog] [info] finished quantifyLibrary()

And here is my lib_format_counts.json, which has 0s for the different potential libTypes and shows a very high rate of inconsistent/orphan mappings:

    "read_files": "[ R2_corrected.fastq.gz, R1.fastq.gz]",
    "expected_format": "ISR",
    "compatible_fragment_ratio": 1.0,
    "num_compatible_fragments": 46363434,
    "num_assigned_fragments": 46363434,
    "num_frags_with_concordant_consistent_mappings": 0,
    "num_frags_with_inconsistent_or_orphan_mappings": 135602410,
    "strand_mapping_bias": 0.0,
    "MSF": 0,
    "OSF": 0,
    "ISF": 0,
    "MSR": 0,
    "OSR": 0,
    "ISR": 0,
    "SF": 0,
    "SR": 0,
    "MU": 0,
    "OU": 0,
    "IU": 0,
    "U": 0

And here is the alevin_meta_info.json, which shows the huge attrition in recognized CBs and poor mapping rate:

    "total_reads": 1153394214,
    "reads_with_N": 17295,
    "noisy_cb_reads": 708796684,
    "noisy_umi_reads": 149640,
    "used_reads": 444430595,
    "mapping_rate": 4.019738736091838,
    "reads_in_eqclasses": 46363434,
    "total_cbs": 132768705,
    "used_cbs": 398369,
    "initial_whitelist": 5383,
    "low_conf_cbs": 1000,
    "num_features": 5,
    "no_read_mapping_cbs": 148,
    "final_num_cbs": 3931,
    "deduplicated_umis": 24178832,
    "mean_umis_per_cell": 6150,
    "mean_genes_per_cell": 2810

This, in conjunction with the fact that running with -l A suggests incorrectly the data are unstranded (IU), led @mikelove to suspect alevin may be skipping the libType guesswork that salmon typically performs.

I know these data are of reasonably good quality, because I have instead processed them using zUMIs, which supports SPLiT-seq data. A side-by-side of the same FASTQs processed this way using alevin or using zUMIs gave me ~270 cells that pass filters (alevin) vs ~50k cells that pass filters (zUMIs), so something is definitely up here.

I'd like to confirm a few details and ask for some guidance on how to move forward:

Is the reversal of R1/R2 like I did here necessary/recommended/correct?
Is alevin truly skipping the libType identification bits such that I need to know which architecture to use? If so, how do I know?
Why are so many CBs called as "noisy" here?
Why is the mapping rate so low?
Is there anything else I'm missing that can explain the unexpectedly poor outcome, or some other reason why this approach will not work for these data?

I've also attached here the top 1000 lines of each R1/R2-corrected FASTQ (ParseBio) in the hopes that was somehow helpful in diagnosis, but I can share more reads or the full files some other way if needed

head_R1.fastq.txt head_R2_corrected.fastq.txt

Thanks so much!

Jeremy

rob-p commented 3 years ago

Hi @jeremymsimon,

Somehow, the notification for this in my e-mail got classified as SPAM. Anyway, thank you for the detailed description! I'm going to ping @Gaura here. @Gaura — this is the alternative protocol I was discussing with you yesterday. As you can see, the main issue here is the "noisy" barcodes. Let me know what you think would be necessary to add support for this, and I'm happy to schedule a technical discussion if you want to discuss some options.

jeremymsimon commented 2 years ago

Thanks @rob-p and @Gaura - let me know if I can provide any other information or data!

jeremymsimon commented 2 years ago

@rob-p and @Gaura and update on this? Would be good to know if this is feasible

Gaura commented 2 years ago

Hey @jeremymsimon! I checked the protocol and the pipeline code. The protocol you described is v1 and the Parsebio is v2. I have implemented v2 in salmon and would be testing it this week. v1 can be similarly implemented.

I read the paper and other available resources but I am not clear about the random hexamer usage and it's effects on the barcode. Can you please explain what you meant by BC1s being paired and what's the use of random hexamer, please? Thanks.

jeremymsimon commented 2 years ago

Thanks @Gaura! Sounds promising.

Can you clarify what differences you saw between v1/v2 protocols? My understanding was that the only changes were slight differences in barcode positions within the barcode read, ie something that could be handled with different bc-geometry, but maybe that's all you meant in terms of differing implementations.

Regarding the BC1 and how it could be two sequences for the same sample - this is confusing to explain in text format, but all comes down to the sequential nature of how the cells acquire barcodes in this protocol. We start with a 96-well plate, where each of the top 48 wells contain BOTH an oligo-dT barcode and random hexamer barcode. The samples then get added to each well. Biochemistry happens. Then you pool all the cells together, split them back out into 48 wells again, and each well gets its own BC2. Then repeat for BC3.

So a given transcript may get amplified via one of two amplification primers (oligo-dT or random hexamer), but after that, will get a single BC2 sequence and BC3 sequence added after that. In Fig 1A of the Rosenberg paper, it's as though there isn't just an orange sequence carrying out reverse transcription, there are actually two different (known) sequences associated with different routes of amplification per cell.

The net effect is that a given cell can contain transcripts that have a sequence like this: AACGTGAT-CTGTAGCC-ACACAGAA

or like this: GATAGACA-CTGTAGCC-ACACAGAA

where maybe the first sequence was amplified by oligo-dT and the second was amplified via a random hexamer.

Because they have the same BC2 and BC3 sequence, and the BC1 sequences match a known pairing, we know they come from the same cell and therefore the data should be merged.

Any lab running these experiments will have a table of known pairings (ie the two barcodes added to each of first 48 wells), so that they can be merged and treated as though they came from the same cell

This can either be handled upstream of salmon/alevin as a preprocessing step, like what my slow perl script can do, or it can be handled internally. Having alevin do the collapsing would likely be a lot faster and means the FASTQs don't need any editing, which would be preferable, but I would understand if that is out of scope for you all.

Hopefully that explanation is clear, but if you have any other questions on this I'd be happy to meet and discuss

Gaura commented 2 years ago

Thanks Jeremy! Yes, that's what I was hinting at with different v1/v2 protocols. From their code, you can see differences in amplicon sequences:

For v1: 'NNNNNNNNNNIIIIIIIIGTGGCCGATGTTTCGCATCGGCGTACGACTIIIIIIIIATCCACGTGCTTGAGAGGCCAGAGCATTCGIIIIIIII'
For v2: 'NNNNNNNNNNIIIIIIIIGTGGCCGATGTTTCGCATCGGCGTACGACTIIIIIIIIATCCACGTGCTTGAGACTGTGGIIIIIIII' where the IIIIIIII sequence corresponds to barcode. This is from the pipeline code I mentioned earlier used for this paper.

Do you have a the pairing file for the BC1 barcodes? Is it the Supp Table S12 in the Rosenberg paper? It is needed for development and testing.

jeremymsimon commented 2 years ago

Yup this geometry is correct! As for a pairing file for the BC1 barcodes, this may depend on the version and/or implementation in the lab. For ParseBio ("v2"), it seems to be consistent thus far, and I've linked a pairing table for that below. For homemade SPLiT-seq ("v1"), which barcodes end up in which wells, and which wells actually get utilized may vary. Additionally, some other labs may not use random hexamers at all, meaning we should have some flexibility such that users provide their own table and also an option for whether this pairing table is strictly required.

In this barcode sharing file, the first column represents the oligo-dT BC1 sequences, and the second column are the paired random hexamer BC1 sequences. Note this file is the same as this one here on my github.

rob-p commented 2 years ago

Hi @jeremymsimon,

I've discussed the support for SPLiT-seq/ParseBio with @Gaura in some depth. Honestly, I think the cleanest solution right now is just to have a more streamlined (and streaming) way to match / replace the random hexamers upstream of alevin-fry. By my understanding, if we can simply replace barcode 1 appropriately (as your Perl script currently does), everything should work downstream in alevin/alevin-fry.

To that end, I've thrown together a small rust program based on your Perl script. Currently that lives here. It reads the same basic parameters as the Perl script, and writes its output to stdout so that it can be used with named pipes. For example, something like:

  <normal salmon command> -1 read_file_1.fq -2 <(splitp --read-file read_file_2.fq --bc-map bcSharing_example.txt --start 79 --end 86 --one-hamming)

which will transform the second fastq file and stream the transformed reads out which can then be read by alevin-fry. One important thing to note is that while alevin requires the input reads to be a real file (i.e. you can't stream reads in because it does 2 passes), if you are mapping these reads for processing with alevin-fry you can use the process substitution trick above. As you hinted, this program works considerably faster than the Perl script. For example, for the first 10,000,000 reads in SRR6750042, the Perl script took 2m 48s to transform the reads and splitp took ~6s (if the output wasn't being written to a file on disk it took <4s). This should generally be fast enough to not be a speed bottleneck.

So, perhaps the next step is to try to help you walk through this approach with a test dataset (and ideally using alevin-fry) to see if things are turning out as expected?

jeremymsimon commented 2 years ago

Hey @rob-p and @Gaura - thanks for this! My understanding is that your splitp will replace my slow perl script, which is great, and then alevin-fry should work pretty much like any other run, yes? If so, can you comment on the low alignment rate and other oddities I encountered running regular alevin following editing of my R2 FASTQ in this way (documented above), and whether there's something inherently different about alevin-fry that should address those issues? Because I currently detect only a tiny fraction of the cells expected.

I'm more than happy/eager to give splitp+alevin-fry a try, but I suspect there's some secondary issue at hand that we'll need to address downstream

Gaura commented 2 years ago

Hi @jeremymsimon! In order to test and validate the implementation I would need a count matrix generated on samples. Do you have a sample and count matrix from that? The Rosenberg submission of the data has an unclear way of specifying barcodes and I have emailed him about it. If you have count matrix and matching fastqs that we can use to validate, we can wrap it up soon.

jeremymsimon commented 2 years ago

Hey @Gaura - can you clarify what you'd need here? You need a count matrix generated via alevin/alevin-fry on these fastqs? Or you need a count matrix generated by a more "gold-standard" method to compare against?

Gaura commented 2 years ago

Something by a "gold-standard" method to compare against. Sorry for being unclear.

Gaura commented 2 years ago

Yeah, that should quantify all cell barcodes. However, if you use an unfiltered list of barcodes (whitelist) or knee-distance method you would filter some, which I usually do in alevin-fry. Ideally, you wouldn't use alevin to generate the comparison matrix.

jeremymsimon commented 2 years ago

Apologies- I deleted my earlier comment because I realized you don't really need any alevin output from me.

I also can't seem to find actual barcode sequences on the Rosenberg data, the cells seem to just be indexed from 0-163068, which is not going to be helpful here.

I assume you haven't heard back from the authors yet about the actual barcode sequences for these matrices? If not, I think I know of a few other published papers that we may be able to use as test cases instead

Gaura commented 2 years ago

Yeah, I haven't heard back yet. Any test case is fine where the data is publicly accessible. Thanks.

jeremymsimon commented 2 years ago

What about if I use zUMIs on the Rosenberg raw data? We could compare those count matrices to those from alevin. But if you'd feel more comfortable using count matrices directly from a published study, I will look for one to use as an example

Gaura commented 2 years ago

I think a published study would be preferable, but I haven't tried zUMIs. I guess it could work too.

jeremymsimon commented 2 years ago

Hey @Gaura - I have a test dataset for us to use. I'm about to set up an alevin run of my own, but wanted to pass it on to you in the meantime. I haven't yet done any testing or exploration of my own yet, though the data comes from a collaborator of ours.

The raw (FASTQ) and processed data (UMI counts matrix) is accessible from GEO at GSE137941 and SRA (fastq-dump --split-files SRR10174292).

These data were generated with the original SPLiT-seq method (your "v1"). The caveat is that they did NOT combine the oligo-dT and random hexamer barcodes, meaning they are separate cells/columns in their processed data matrix. This means we should be able to run alevin/alevin-fry directly on these FASTQs, bypassing splitp for now, and get something that hopefully matches their processed data matrix.

According to the methods section of their paper, they used GENCODE v28 annotations, and ultimately kept 6,888 nuclei after filtering. Their processed data matrix seems to have 25,000 columns, so I suppose this is pre-filtering.

Let me know what you think!

Gaura commented 2 years ago

This is great find! Thanks @jeremymsimon! I have started the download, will run it today.

jeremymsimon commented 2 years ago

Okay my alevin run finished, and I got a mapping rate of just 6.1% and 2254 cells detected. My lib_format_counts.json contains the following:

{
    "read_files": "[ SRR10174292_2.fastq.gz, SRR10174292_1.fastq.gz]",
    "expected_format": "ISR",
    "compatible_fragment_ratio": 1.0,
    "num_compatible_fragments": 15259749,
    "num_assigned_fragments": 15259749,
    "num_frags_with_concordant_consistent_mappings": 0,
    "num_frags_with_inconsistent_or_orphan_mappings": 61866895,
    "strand_mapping_bias": 0.0,
    "MSF": 0,
    "OSF": 0,
    "ISF": 0,
    "MSR": 0,
    "OSR": 0,
    "ISR": 0,
    "SF": 0,
    "SR": 0,
    "MU": 0,
    "OU": 0,
    "IU": 0,
    "U": 0
}

so lots of fragments are discarded for one reason or another, and it's not clear whether the library type assignment is working properly, sort of like my initial example above.

Separately, I'm running zUMIs on the same files and will report back with those data when the run is complete

rob-p commented 2 years ago

Hi @jeremymsimon --- I am noticing this as jumping out:

"num_frags_with_inconsistent_or_orphan_mappings": 61866895,

Is ISR the right protocol for this data, in the manner in which the reads are provided to alevin? @Gaura is doing a run with alevin-fry and we'll discuss those results here when we have them.

--Rob

jeremymsimon commented 2 years ago

@rob-p I have no clue! That's actually the main reason I opened this thread in the first place...

Given only one mate of the pair contains cDNA sequence, I'm not sure what is appropriate here or how to know

rob-p commented 2 years ago

Hi @jeremymsimon,

So I am sure @Gaura will have some more insight tomorrow, but I did a quick alevin --sketch mode mapping of the reads against just the spliced gencode v35 transcriptome, and I get the following:

    "num_processed": 248333566,
    "num_mapped": 55954902,

which works out to ~23% mapping rate. This still seems very low, but if new cells are discovered at the same rate it could definitely get to the ~7,000 mark.

The command I used for processing was:

salmon alevin -lIU -i idx -1 SRR10174292_2.fastq -2 SRR10174292_1.fastq \
--read-geometry 2[1-end] \
--umi-geometry 1[1-10] \
--bc-geometry 1[11-18,49-56,79-86] \
--sketch -o SRR10174292_map -p 16

as you used in the first post in this thread, though I'm not sure if that's right for this data or not.

--Rob

jeremymsimon commented 2 years ago

I think this run should have

--bc-geometry 1[11-18,49-56,87-94]

Also you skipped the splitp step for this run, correct? I saw "split" in your filename so just wanted to check

rob-p commented 2 years ago

Yes, no splitp was used here. I didn’t know the appropriate barcode geometry for this sample, so I just guessed. I’ll retry with the new geometry. However, fir rad/sketch mode, this should not affect the mapping rate at all.

Gaura commented 2 years ago

Yesterday, I got a mapping rate of about 16% with --split-seqV1 flag and alevin-fry using salmon index based on GENCODE v31 reference and salmon index without decoy. The number of nuclei identified with knee-distance filter is 6928. However, the number of barcodes that match b/w the submission and run is 305 which I am trying to figure out why. Earlier, we spotted a position error in about 10% of the split-seq reads in a subset of 2.5M reads. The fixed sequence ATCCACGTGCTTGAGAGGCCAGAGCATTCG doesn't always occur at its position and sometimes it's a base or 2 off, which changes the expected barcodes. For example, the sequence in red is part fixed sequence and part 8-bp barcode and it should be in the end of read but sometimes it is not.

jeremymsimon commented 2 years ago

@Gaura this sort of frameshift in the barcodes is a known issue, and can be computationally challenging (at least for existing methods). zUMIs, for example, does an automatic barcode detection based on fixed barcode positions like we're doing here with alevin, so it would mis-detect cells like the shifted ones you pasted above.

For SPLiT-seq, we do know exactly which barcodes go into the wells, however, so it is technically possible to restrict based on all possible known combinations of barcodes instead and be more positionally flexible. But deciding how many indel bases are allowable, and presumably doing multiple passes through the data to establish an include-list could be time-consuming. Further, the zUMIs developer rightly mentions in this thread that there are likely going to be many unused barcode combinations this way, so lots of time could be spent looking for "cells" that don't actually exist in the data.

The authors of the paper from which our test dataset was derived describe in their methods using a Drop-seq computational framework, so I'm not sure which approach theirs is more similar to.

The simplest approach here is certainly the automatic detection, but it will come at the cost of losing meaningful reads to frameshift errors.

My guess is this falls well out of the scope of alevin, but if you're interested in improving on that, there may be a middle ground between the two approaches above, one that I'm not sure if your group or others have attempted for other methods: we could essentially do a 2-pass barcode detection. The first pass would restrict based on positions like we're already doing, and establish an include-list of possible barcodes seen in the data. Then we could pass through the barcode sequences a second time, looking only for those sequence combinations, but allowing 1-2bp flexibility in the positions they occur, potentially rescuing some of the ones missed during the first pass. This would get around the issue of searching for thousands (or more) barcodes that never exist.

However, for your above sequences in red, we would still need to somehow collapse the barcodes GATAGACA, ATAGACAT, and ATAGACAG, but perhaps that can be achieved with some Levenshtein distance flexibility using the entire 24bp barcode sequence detected...?

Anyway sorry for the brainstorming dump, but the short answer is: we're probably stuck losing a bunch of reads due to positional errors like this

rob-p commented 2 years ago

Hi @jeremymsimon and @Gaura,

After redoing the analysis with the correct barcode geometry (still with --sketch-mode), using knee filtering with alevin-fry leads to the following:

robp@ocean1:/data007/users/rob/SRR10174292$ ~/alevin-fry/target/release/alevin-fry generate-permit-list -k -i SRR10174292_map -o SRR10174292_quant -d both
2021-12-03 09:49:37 INFO paired : false, ref_count : 228,754, num_chunks : 11,194
2021-12-03 09:49:37 INFO read 2 file-level tags
2021-12-03 09:49:37 INFO read 2 read-level tags
2021-12-03 09:49:37 INFO read 1 alignemnt-level tags
2021-12-03 09:49:37 INFO File-level tag values FileTags { bclen: 24, umilen: 10 }
2021-12-03 09:49:50 INFO observed 55,952,280 reads in 11,194 chunks --- max ambiguity read occurs in 2,330 refs
2021-12-03 09:49:50 INFO max_idx = 268803
2021-12-03 09:49:50 INFO max_idx = 86532
2021-12-03 09:49:50 INFO max_idx = 30016
2021-12-03 09:49:50 INFO max_idx = 12061
2021-12-03 09:49:50 INFO max_idx = 8292
2021-12-03 09:49:50 INFO max_idx = 7396
2021-12-03 09:49:50 INFO max_idx = 7112
2021-12-03 09:49:50 INFO max_idx = 7012
2021-12-03 09:49:50 INFO max_idx = 6969
2021-12-03 09:49:50 INFO max_idx = 6952
2021-12-03 09:49:50 INFO knee-finding iter = 10
2021-12-03 09:49:50 INFO max_idx = 6944
2021-12-03 09:49:50 INFO max_idx = 6938
2021-12-03 09:49:50 INFO max_idx = 6937
2021-12-03 09:49:50 INFO knee distance method resulted in the selection of 6938 permitted barcodes.
2021-12-03 09:49:52 INFO total number of distinct corrected barcodes : 391,939

so 6938 detected cells seems about right. Of course, since I'm not using the external unfiltered permit list here, they are likely not the same barcodes as in the deposited data due to the issue that @Gaura mentioned.

jeremymsimon commented 2 years ago

Hey @Gaura and @rob-p - just as a point of comparison, I ran zUMIs on the same exact files. With nominal filtering I get 12,942 cells recovered, of which 10,386 (80%) were also contained in the published matrix (exact sequence matches).

Happy to provide any of those files if that's useful. But it does seem to argue something is funky when it comes to alevin's detection of these barcodes, right?

Gaura commented 2 years ago

Hey @jeremymsimon! I figured out what the issue was. Alevin is fine, but my implementation of the split-seq protocol had an issue. I was able to get good correlation with the submitted counts and good match between barcodes. Here's what I did:

Used fastp to clean the fastq files.

Ran salmon alevin with custom geometry flags and --justAlign. For index I used human GENCODE v31 reference.

salmon alevin -i /biodb/human/gencode/v31/salmon_index_wo_decoy/ -l A -1 SRR10174292_trimmed_1.fastq.gz -2 SRR10174292_trimmed_2.fastq.gz -p 44 --bc-geometry 2[11-18,49-56,87-94] --umi-geo 2[1-10] --read-geo 1[1-end] -o runtestdata_custombc --justAlign

Used alevin-fry on the output

$ cd runtestdata_custombc
$ alevin-fry generate-permit-list -i ./ -d both --output-dir  run -k
2021-12-06 15:36:39 INFO paired : false, ref_count : 226,030, num_chunks : 7,9282021-12-06 15:36:39 INFO read 2 file-level tags2021-12-06 15:36:39 INFO read 2 read-level tags
2021-12-06 15:36:39 INFO read 1 alignemnt-level tags
2021-12-06 15:36:39 INFO File-level tag values FileTags { bclen: 24, umilen: 10 }2021-12-06 15:36:45 INFO observed 39,536,527 reads in 7,928 chunks --- max ambiguity read occurs in 196 refs2021-12-06 15:36:45 INFO max_idx = 170639
2021-12-06 15:36:45 INFO max_idx = 59128
2021-12-06 15:36:45 INFO max_idx = 21206
2021-12-06 15:36:45 INFO max_idx = 10151
2021-12-06 15:36:45 INFO max_idx = 7852
2021-12-06 15:36:45 INFO max_idx = 72462021-12-06 15:36:45 INFO max_idx = 70432021-12-06 15:36:45 INFO max_idx = 69602021-12-06 15:36:45 INFO max_idx = 6937
2021-12-06 15:36:45 INFO max_idx = 6925
2021-12-06 15:36:45 INFO knee-finding iter = 10
2021-12-06 15:36:45 INFO max_idx = 6922
2021-12-06 15:36:45 INFO knee distance method resulted in the selection of 6923 permitted barcodes.
2021-12-06 15:36:46 INFO total number of distinct corrected barcodes : 333,352
$ alevin-fry collate -i run/ -t 16 -r ./                     
2021-12-06 15:37:02 INFO filter_type = Filtered
2021-12-06 15:37:02 INFO collated rad file will not be compressed
2021-12-06 15:37:02 INFO paired : false, ref_count : 226,030, num_chunks : 7,928, expected_ori : Unknown
2021-12-06 15:37:02 INFO read 2 file-level tags
2021-12-06 15:37:02 INFO read 2 read-level tags
2021-12-06 15:37:02 INFO read 1 alignemnt-level tags
2021-12-06 15:37:02 INFO File-level tag values FileTags { bclen: 24, umilen: 10 }2021-12-06 15:37:06 INFO deserialized correction map of length : 1,317,581
2021-12-06 15:37:06 INFO Generated 14 temporary buckets.
⠁ [00:00:00] [╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠈ [00:00:00] [╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠲ [00:00:00] [╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠁ [00:00:00] [╢╢╢╢╢╢╢╢░╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠒ [00:00:00] [╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠖ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠄ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢░╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠤ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠤ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠚ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢░╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠤ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠐ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠤ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢░╟╟╟╟╟╟╟╟╟╟]    ⠄ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢░╟╟╟╟╟╟╟╟]    ⠴ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟]    ⠄ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟]    ⠤ [00:00:01] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟]    ⠄ [00:00:01] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]      [00:00:01] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]      [00:00:01] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]    7928/7928    partitioned records into temporary files.
⠁ [00:00:00] [╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠙ [00:00:00] [╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠒ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠂ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢░╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠲ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠤ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟]    ⠄ [00:00:00] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢░╟╟╟╟╟]    ⠠ [00:00:01] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]      [00:00:01] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]      14/14      gathered all temp files.
2021-12-06 15:37:08 INFO writing num output chunks (6,923) to header
2021-12-06 15:37:08 INFO expected number of output chunks 6,923
2021-12-06 15:37:08 INFO finished collating input rad file "./map.rad".
$ alevin-fry quant -m /biodb/human/gencode/v31/txp2gene.tsv -i ./run/ -o ./res -t 16 -r full --use-mtx 
2021-12-06 15:37:20 INFO quantifying from uncompressed, collated RAD file File { fd: 4, path: "../runtestdata_custombc/run/map.collated.rad", read: true, write: false }
2021-12-06 15:37:20 INFO paired : false, ref_count : 226,030, num_chunks : 6,923
2021-12-06 15:37:21 INFO tg-map contained 60,603 genes mapping to 226,030 transcripts.
2021-12-06 15:37:21 INFO read 2 file-level tags
2021-12-06 15:37:21 INFO read 2 read-level tags
2021-12-06 15:37:21 INFO read 1 alignemnt-level tags
2021-12-06 15:37:21 INFO File-level tag values FileTags { bclen: 24, umilen: 10 }
⠓ [00:00:00] [╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠐ [00:00:00] [╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠁ [00:00:00] [╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠴ [00:00:01] [╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠤ [00:00:01] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠁ [00:00:01] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠤ [00:00:01] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠠ [00:00:02] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢░╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠐ [00:00:03] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢░╟╟╟╟╟╟╟╟╟╟╟╟╟]    ⠋ [00:00:03] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢░╟╟╟╟╟╟╟╟╟╟]    ⠄ [00:00:04] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢░╟╟╟╟╟╟╟]    ⠈ [00:00:04] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟╟╟╟]    ⠙ [00:00:04] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢▌╟╟]    6551/6923    
2021-12-06 15:37:26 WARN 
found connected component with 30679 vertices, resolved into 18 UMIs over 10 genes with trivial resolution.
[00:00:07] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]    6923/6923    finished quantifying 6,923 cells.2021-12-06 15:37:28 INFO processed 26,250,078 total read records

Found that 6913 out of 6923 (>99%) barcodes are present in the submitted data.
Finally ran a correlation b/w the alevin-fry output (located in res/alevin) and submitted data. Here are the results:

 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.03701 0.74469 0.88377 0.83131 0.94898 1.00000

This demonstrates that alevin performs well with split-seq protocol. Let me know what you think, @jeremymsimon.

jeremymsimon commented 2 years ago

@Gaura that definitely seems promising! A few questions:

-Is quantification via alevin-fry (rather than alevin itself) mandatory here? I ask since your run seems successful whereas my full alevin run had a very poor BC detection and mapping rate.

-I see you specified -l A - can you comment on what the detected/correct library type was here?

-I assume all of this will also work in conjunction with --expectCells or --keepCBFraction if those parameters were needed? Your ~7k cells detected is very close to the published number post-filtering, but no similar filtering has been done here yet. My guess is that the proportion of cells that pass these filters will be higher for alevin, but we may still be under-estimating the number of real cells by a little bit here.

-Is there any prospect of dealing with frameshift errors in the barcode detection step? Or is that out of scope?

Thanks!

rob-p commented 2 years ago

Hi @jeremymsimon,

A few quick thoughts on this.

-Is quantification via alevin-fry (rather than alevin itself) mandatory here? I ask since your run seems successful whereas my full alevin run had a very poor BC detection and mapping rate.

I don't think we've been able to successfully obtain the same concordance with alevin yet (as opposed to alevin-fry). There is more sophisticated internal barcode logic going on there, and we may need to pull @k3yavi in to see what is happening outside of the RAD -> fry pipeline.

-I see you specified -l A - can you comment on what the detected/correct library type was here?

Unlike alevin, when you run with in --rad or --sketch mode, the library type isn't really relevant. All mappings are passed through to the rad file. Subsequently, in alevin-fry there is a -d (direction) flag that is used to filter mappings that don't concord with the expected orientation. I'm not sure what @Gaura used in the run above — the default is to keep reads from either orientation.

-I assume all of this will also work in conjunction with --expectCells or --keepCBFraction if those parameters were needed? Your ~7k cells detected is very close to the published number post-filtering, but no similar filtering has been done here yet. My guess is that the proportion of cells that pass these filters will be higher for alevin, but we may still be under-estimating the number of real cells by a little bit here.

According to the commands listed, @Gaura used alevin-fry's built-in knee-like filtering. This tries to use a knee on the cumulative read count histogram to determine a good cutoff for "reliable" cells versus poor quality cells. Alternatively, one can provide an external permit list with -u (unfiltered permit list) to quantify all cells that match any known barcode. This will generally result in many more quantified cells, which you will then want to filter post-quantification.

jeremymsimon commented 2 years ago

Okay got it. And is there any prospect of dealing with frameshift errors in the barcode detection step? Or is that out of scope?

rob-p commented 2 years ago

Hi @jeremymsimon — @Gaura is going to take a look at unfiltered permit listing and will share those results here later.

Regarding frameshift errors, I think that's certainly out of scope for the alevin -> fry phase, but that type of thing could be in scope for splitp.

Basically, my logic / reasoning is this: I'd like to avoid further complicating the already immense salmon/alevin codebase with special implementations handling problems outside of their core function (e.g. mapping reads to the reference efficiently and quantifying UMIs/barcode). Since most protocols (and the most common) have quite simple barcode geometry, it makes sense for this code to live there. I'm fully supportive of enabling support for more complex barcode geometries and preprocessing requirements if there are folks whom it would help, but it feels like that essential complexity belongs upstream of alevin / fry, so that by the time the reads get to alevin, it can assume a straightforward geometry.

So TLDR : I think we'd be willing to investigate what is required to address potential frameshift errors, and how much of a difference that makes, but I think that analysis and eventual implementation (if we decide it's worth it), belongs in splitp.

jeremymsimon commented 2 years ago

I confirmed with the developer of zUMIs that no frameshift detection/correction is happening in their approach for SPLiT-seq libraries, so the barcode discovery should be fairly consistent with what alevin is already doing (ie with fixed geometry positions). So, likely no need to incorporate this into splitp at the moment but if we/others determine that frameshifts are frequent enough and the data can improve in some noticeable way with correcting them, we can revisit later as you suggested.

As for the barcode detection - my usual approach with alevin at least is to let it try to estimate a "real" cell number, but if it's way off from our experimental expectations, to inject --expectCells ncells and let that serve as a starting point (with subsequent filtering). That has worked reasonably well in the past for me , and seems to be an option for alevin-fry as well. I don't know whether that is poor practice in the long run...it came from a place of seeing far too many weak knee plots early in the droplet scRNA-seq days. Are you generally more trusting of these estimates these days?

Gaura commented 2 years ago

Hey @jeremymsimon! I generated the barcode list using all combinations of 8-bp barcode sequence in sci-seq-pipeline github repo for version v1. Using alevin-fry generate-permit-list -u barcode_list instead of -k resulted in 142,667 nuclei. After using Seurat filtering on the generated counts, min.cells = 3, min.features = 200 as done here, I got 7,442 nuclei. All of the barcodes were present in the submission. The correlation b/w alevin-fry and submission is also good:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2966  0.7127  0.8689  0.8163  0.9448  0.9963

Addressing your previous question, I would recommend using salmon alevin and alevin-fry for single cell quantification. It is much faster than using alevin. Let me know if you have any questions.

rob-p commented 2 years ago

Hi @jeremymsimon, to answer your last question:

As for the barcode detection - my usual approach with alevin at least is to let it try to estimate a "real" cell number, but if it's way off from our experimental expectations, to inject --expectCells ncells and let that serve as a starting point (with subsequent filtering). That has worked reasonably well in the past for me , and seems to be an option for alevin-fry as well. I don't know whether that is poor practice in the long run...it came from a place of seeing far too many weak knee plots early in the droplet scRNA-seq days. Are you generally more trusting of these estimates these days?

So one of the nice aspects of the alevin to alevin-fry pipeline is that it's relatively easy to try different filtering approaches since the initial mapping process only has to happen once. In general, the knee detection method is pretty good, and often gives a reasonable cell count. However, this isn't always the case. What we find in the alevin-fry pre-print is that it tends to be slightly more conservative than if you did e.g. unfiltered quantification followed by filtering with something like DropletUtils (but usually only slightly). The knee method is basically the iterative knee finding procedure from UMI-tools, with some slight tweaks to the parameters.

However, unlike alevin, alevin-fry also supports unfiltered quantification. In this case, you provide an unfiltered-permitlist, which is a set of acceptable barcodes (not necessarily all expected to be present), and alevin-fry will correct against this. This will tend to produce a lot of quantified cells, since we quantify any barcode matching 10 or more reads (by default, this value is modifiable on the command line). So, such unfiltered matrices definitely need to be filtered after quantification. However, for protocols with an external permit list, or those where you can reasonably derive a list of potential expected barcodes, it's less stringent and therefore potentially a bit more sensitive than knee-based filtering.

jeremymsimon commented 2 years ago

Got it, thanks @rob-p! I will pull the develop branch soon and give this all a try on my end. Thanks to you and @Gaura for your hard work on this!

rob-p commented 2 years ago

Sounds good @jeremymsimon! Thank you for all of your help on this. If you find anything unexpected (or expected :P) once you have a chance to try this out on your end, let us know. Since I think this issue is now resolved, I'm going to close it. But we can open it back up, or open another issue, if we need to continue the discussion.

COMBINE-lab / salmon

Identifying correct library type for SPLiT-seq/ParseBio reads #699