Key error when calling cassiopeia.pp.call_lineage_groups

YosefLab / Cassiopeia

A Package for Cas9-Enabled Single Cell Lineage Tracing Tree Reconstruction

https://cassiopeia-lineage.readthedocs.io/en/latest/

MIT License

83 stars 25 forks source link

Key error when calling cassiopeia.pp.call_lineage_groups #110

Closed oligomyeggo closed 3 years ago

oligomyeggo commented 3 years ago

Hello, I am following the preprocessing notebook in the refactor branch with my own GESTALT data. I believe it has been working well (or at least has not errored out; I haven't had a chance to dig into the various outputs to see if things make sense with my experiment), up until the last step in the notebook. When I ran allele_table = cassiopeia.pp.call_lineage_groups(umi_table, output_dir), I got the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Sample'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/pandas/core/generic.py in _set_item(self, key, value)
   3575         try:
-> 3576             loc = self._info_axis.get_loc(key)
   3577         except KeyError:

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 

KeyError: 'Sample'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-10-2b5fa699a3d8> in <module>
----> 1 allele_table = cassiopeia.pp.call_lineage_groups(umi_table, output_dir)

~/Desktop/code/Cassiopeia/cassiopeia/preprocess/pipeline.py in call_lineage_groups(input_df, output_directory, min_umi_per_cell, min_avg_reads_per_umi, min_cluster_prop, min_intbc_thresh, inter_doublet_threshold, kinship_thresh, verbose, plot)
    997     )
    998 
--> 999     allele_table = l_utils.filtered_lineage_group_to_allele_table(filtered_lgs)
   1000 
   1001     if verbose:

~/Desktop/code/Cassiopeia/cassiopeia/preprocess/lineage_utils.py in filtered_lineage_group_to_allele_table(filtered_lgs)
    408 
    409     final_df["Sample"] = final_df.apply(
--> 410         lambda x: x.cellBC.split(".")[0], axis=1
    411     )
    412 

~/.local/lib/python3.6/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3042         else:
   3043             # set column
-> 3044             self._set_item(key, value)
   3045 
   3046     def _setitem_slice(self, key: slice, value):

~/.local/lib/python3.6/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3119         self._ensure_valid_index(value)
   3120         value = self._sanitize_column(key, value)
-> 3121         NDFrame._set_item(self, key, value)
   3122 
   3123         # check if we are modifying a copy

~/.local/lib/python3.6/site-packages/pandas/core/generic.py in _set_item(self, key, value)
   3577         except KeyError:
   3578             # This item wasn't present, just insert at end
-> 3579             self._mgr.insert(len(self._info_axis), key, value)
   3580             return
   3581 

~/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py in insert(self, loc, item, value, allow_duplicates)
   1196             value = _safe_reshape(value, (1,) + value.shape)
   1197 
-> 1198         block = make_block(values=value, ndim=self.ndim, placement=slice(loc, loc + 1))
   1199 
   1200         for blkno, count in _fast_count_smallints(self.blknos[loc:]):

~/.local/lib/python3.6/site-packages/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype)
   2742         values = DatetimeArray._simple_new(values, dtype=dtype)
   2743 
-> 2744     return klass(values, ndim=ndim, placement=placement)
   2745 
   2746 

~/.local/lib/python3.6/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
   2398             values = np.array(values, dtype=object)
   2399 
-> 2400         super().__init__(values, ndim=ndim, placement=placement)
   2401 
   2402     @property

~/.local/lib/python3.6/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
    129         if self._validate_ndim and self.ndim and len(self.mgr_locs) != len(self.values):
    130             raise ValueError(
--> 131                 f"Wrong number of items passed {len(self.values)}, "
    132                 f"placement implies {len(self.mgr_locs)}"
    133             )

ValueError: Wrong number of items passed 9, placement implies 1

Any insights as to what might have happened? I haven't changed any of the options or arguments for any of the steps in the preprocessing notebook. The only difference is that I started with a different bam file. Thanks for any help!

mattjones315 commented 3 years ago

Hello,

In looking at the code, I realize that this step might not be really generalizable. Specifically, in our analyses we typically prepend the Lane or some sample identifier to each cellBC and this step works with that sample identifier, adding it as a variable to the data frame. It shouldn't affect any analysis, but it's a nice-to-have for our work. I'm going to remove this assumption from the codebase.

However, I don't think it should affect your analysis -- I'm not sure why this is throwing an error, complaining that "Sample" doesn't exist as a column in your dataframe because here we're adding the column. Instead, I'm thinking your molecule table is malformed in some way.

Can you please attach a toy example of your molecule table (what you're using as input to this function) so I can evaluate if there's something else wrong with the molecule table? It would be interesting to see what features we need to make a bit more flexible in our pipeline to support your dataset. We'd really like to fully support GESTALT data in this pipeline so thank you for working with us!

oligomyeggo commented 3 years ago

Hi @mattjones315, thanks for you help! I am doing the first GESTALT experiment on our campus, so I appreciate the patience and support as I go through your pipeline with my data. I've attached a small toy_umi_table which should hopefully let you see if my molecule table is malformed and if there is a previous preprocessing step that I should revisit to fix any potential issue. Let me know if that file doesn't work for you for some reason. toy_umi_table.csv

mattjones315 commented 3 years ago

Hi @oligomyeggo - no problem at all! That's really exciting to hear that you're pioneering the way for GESTALT on your campus, I hope this will be a useful thread for future users too.

In looking at your example molecule table, I find it odd that you don't have any indel calls (these are stored in the columns r1, r2, and r3). I'm anticipating that this is because the cut-site anchors were not properly set when calling indels with cas.pp.call_alleles. If you take a look at the documentation hosted here, you can see that we support users specifying the exact location (and anticipated cut-site window width) in cas.pp.call_alleles. (Also, I'm anticipating that you won't have an intBC on your GESTALT vector, so you can set this argument to be (0,0), I think). When you were pushing your data through, did you use the default settings here or did you specify your cut-site locations? If it's the latter, can you attach the reference sequence and your intended cut-site locations so I can dig into the codebase and see why it's not working as expected?

Also, it looks you have well-formed CIGAR strings, so I'm assuming you were able to specify your reference sequence appropriately in cas.pp.align_sequences, but can you confirm here that you were able to do so?

Thanks again for your patience and hope this helps.

oligomyeggo commented 3 years ago

Ah, that was definitely an oversight on my part! I ran through the notebook again using default parameters, but adjusted call_alleles per your recommendations. My code looked like this:

umi_table = cassiopeia.pp.call_alleles(umi_table, 
                                       ref_filepath = target_site_reference, 
                                       barcode_interval = (712, 977),
                                       cutsite_locations = [728, 755, 782, 809, 836, 863, 890, 917, 944, 971],
                                       cutsite_width = 54)

Based on my understanding, I do want to use the barcode_interval based on my experimental design; the GESTALT barcode array is located in the 3' UTR of the DsRed sequence, and my reference file includes the DsRed sequence, GESTALT array, and SV40 sequence. Is that correct, or did I misinterpret the documentation?

I also have a question about the cutsite_width argument, which I probably set incorrectly. I am not actually sure how to determine this. Is it just a matter of counting the nucleotides on each side of cutsite until you reach another cutsite (that's what I did in the code above)? Or is it more restrictive than that? I know we have 10 CRISPR/Cas9 targets (23 nucleotides each) separated by 4 nucleotide linkers. The cutsite for each target is the 17th nucleotide in the sequence. So, would my cutsite_width actually be the number of nucleotides on both sides of a cutsite within a single target (and if that's the case, the number of nucleotides left and right of the cutsite are different lengths, so would I just pick the greater of the two)? Sorry if this isn't clear and is a bit naive. This is my first time working with a CRISPR/Cas9 system and lineage barcoding. I've attached both my reference file and cutsite file if that helps clarify things.

At any rate, adjusting the arguments for the call_alleles function seemed to have fixed the molecule table, as I was able to run the call_lineage_groups function without issue this time, so that's something! It looks like I was able to recover alleles for 279 cells, which might be a little on the low end? (Our transcriptome dataset has ~8500 cells). I've attached my updated toy_umi_table if that is helpful to look at. Happy to also provide the allele_table if that is useful.

toy_umi_table.csv gestalt.fa.cutSites.txt gestalt.fa.txt

Do you think there are any other parameters in any of the other preprocessing steps that I should consider with my GESTALT data, or are the default arguments most likely sufficient?

I know that was a lot of questions. Thanks again for your help! I appreciate it.

mattjones315 commented 3 years ago

Hi @oligomyeggo -- great, thanks for providing more information here. With these updated parameters, the resulting molecule table looks a lot better! But there are few more things you should tweak, I think --

I think there are a couple of parameters that are not as clearly documented if you are not used to working with our Weissman lab recorder. Specifically, the barcode_interval parameter refers specifically to the integration barcode (a ~10bp randomer at the beginning of every TargetSite) in our technology, not the entire length of the TargetSite cassette overall. I don't think the GESTALT technology has an integration barcode (or intBC) so you can set this value to be (0,0) -- please correct me if I'm wrong. I'm thinking the desired behavior would be to have every intBC be set to "NC" because in the GESTALT system there are not intBCs.

Additionally, your reference sequence appears to be far longer than any of your observed sequences. We often begin our references with the primer site -- in this case, I believe your reference sequence would be the same as what you thought your barcode should have been: specifically, the interval (712, 977). You can probably set a longer interval if you'd like, but it helps to begin the reference sequence with the primer sequence.

Finally, we use cutsite_width to determine how far away from the intended cut-site to look for indels (there could be some stochasticity in how we align, sequence, or even how Cas9 cuts perhaps). Our default is 12, but you can modify this if you'd like. Specifically, if targets are a distance of 4 nucleotides from one another, then you probably would want to set the cutsite_width to be around 3 or 4.

I think making these changes will allow Cassiopeia to detect more cells as well, since cells and molecules might be filtered out if their alleles are being called incorrectly. To check though, how many cells do you have after each step? There should be a log file that you can check to see the number of cells / UMIs that are maintained and filtered after each step. It should be saved in the output directory location. If you're losing too many cells during filtering, you can modify the UMI / cellBC and average reads / UMI thresholds and this should result in keeping around more cells.

I should also ask if you expect there to be many clonal populations in this dataset, and if so how you intend to detect them. In our system, we use sets of intBCs to detect clonal populations but if you don't have this information in your GESTALT system then I'm afraid the call_lineage_groups functionality will not work as intended. This might also be a reason why you're observing a lot of cells being filtered out. Instead, you may be better off collapsing the molecule table to an allele table directly, using this code:

from cassiopeia.preprocess import lineage_utils
allele_table = lineage_utils.filtered_lineage_group_to_allele_table([umi_table])

You can then do one more round of quality-control filtering if you'd like:

from cassiopeia.preprocess import utilities
allele_table = utilities.filter_cells(
        allele_table,
        min_umi_per_cell=int(min_umi_per_cell),
        min_avg_reads_per_umi=min_avg_reads_per_umi,
        verbose=verbose,
    )

With this allele table then you might need to come up with a way to identify clonal populations after the fact, if you need to. Hope this helps and please let me know if have any more questions!

Best, Matt

oligomyeggo commented 3 years ago

Hi @mattjones315, thanks so much for the detailed response! It was helpful and did clarify some things. I believe you are correct that the GESTALT system does not have intBCs. The GESTALT barcode array is simply 10 TargetSites (23bp in length), each separated by a 4bp linker sequence (the same 4bp linker sequence between each TargetSite).

I reran the preprocessing pipeline after adjusting my reference to only include our primer sequence and barcode array (and I also adjusted my cutsites to reflect the abbreviated reference file). My new command is as follows:

umi_table = cassiopeia.pp.call_alleles(umi_table, 
                                       ref_filepath = target_site_reference, 
                                       barcode_interval = (0, 0),
                                       cutsite_locations = [42, 69, 96, 123, 150, 177, 204, 231, 258, 285],
                                       cutsite_width = 4)

I've attached my updated reference and cutsite files, just in case.

Concerning the number of cells, please find attached the preprocess.log file from my previous run ('_prev') and most recent run ('_current'). It looks like the most recent parameters actually filtered out more cells than the previous run, so I must not be quite getting the arguments right still.

I am not sure how many clonal populations we expect to be honest. We introduced mutations to the barcode array at the one cell stage in zebrafish, so presumably we might have many distinct clonal populations. I am also not sure how I intend to detect them, as I was hoping this was something built into the pipeline somewhere. It sounds like I might have to dig into some other options in order to address this? I'm sure I can figure something out!

When I tried to run the following code:

from cassiopeia.preprocess import lineage_utils
allele_table = lineage_utils.filtered_lineage_group_to_allele_table([umi_table])

I received this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-16-88ce7d8b6352> in <module>
      1 from cassiopeia.preprocess import lineage_utils
----> 2 allele_table = lineage_utils.filtered_lineage_group_to_allele_table([umi_table])

~/Desktop/code/Cassiopeia/cassiopeia/preprocess/lineage_utils.py in filtered_lineage_group_to_allele_table(filtered_lgs)
    403     grouping = ["cellBC", "intBC", "allele"] + grouping + ["lineageGrp"]
    404 
--> 405     final_df = final_df.groupby(grouping, as_index=False).agg(
    406         {"UMI": "count", "readCount": "sum"}
    407     )

~/.local/lib/python3.6/site-packages/pandas/core/frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   6523             squeeze=squeeze,
   6524             observed=observed,
-> 6525             dropna=dropna,
   6526         )
   6527 

~/.local/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    531                 observed=observed,
    532                 mutated=self.mutated,
--> 533                 dropna=self.dropna,
    534             )
    535 

~/.local/lib/python3.6/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    784                 in_axis, name, level, gpr = False, None, gpr, None
    785             else:
--> 786                 raise KeyError(gpr)
    787         elif isinstance(gpr, Grouper) and gpr.key is not None:
    788             # Add key to exclusions

KeyError: 'lineageGrp'

I've attached another toy_umi_table if that's helpful for this new error. Sorry that all of this is a bit messy. It's definitely a "learn as I go" kind of project, and again I really appreciate your time and patience! If at any point you feel it is better to open up a new issue and parse this out error-by-error, or move to email for more theoretical/conceptual discussion instead of clogging up your Github issues just let me know!

gestalt.abbrv.fa.cutSites.txt gestalt.abbrv.fa.txt preprocess_current.log preprocess_prev.log toy_umi_table.csv

mattjones315 commented 3 years ago

Hi @oligomyeggo - thanks for this information, this is very helpful!

The good news is that if you are starting tracing at the one-cell embryo stage in zebrafish, you probably don't have to do any lineage group calling. This is because the entire zebrafish, as I understand it, can be regarded as a single clone -- in other words, every cell is related to one another by the lineage tracer. You can identify interesting subclones from the trees built from the GESTALT arrays. I will fix the issue in filtered_lineage_group_to_allele_table. For the time being, you can circumvent this by adding a dummy column lineageGrp to the molecule table (just set this to 1, or something arbitrary).

I did notice that your alignments still look funky, and I think this is because our alignment strategy is not optimized for GESTALT arrays. After re-reading the original study, I found that they used the Needleman-Wunsch algorithm with a gap open penalty of 10.0 and a gap extend penalty of 0.05. Unfortunately, there is not an optimized implementation of Needleman-Wunsch implementation in Sckit-Bio (which we use for alignment -- we use a "local" alignment strategy instead), but I did find that by changing the parameters in Cassiopeia to a gap open penalty of 10.0 and a gap extend penalty of 0.05 resulted in more reasonable alignments. I suggest you try changing these parameters in cas.pp.align_sequences.

As for filtering out cells, I noticed that a good chunk of cells are filtered out in the very first round of filtering. This is not necessarily surprising, but I'm curious if the settings must be changed a bit. We've noticed that if we sequenced our data with a MiSeq, for example, sometimes we need to reduce our QC thresholds because of the reduced read-depth from MiSeq. I actually can't deduce from the preprocessing log the number of cells in the molecule table at the beginning of each step -- can you provide that information as well? (This can be found with len(unique(umi_table['cellBC']))).

Also, I'm actually quite glad that we're having this conversation on Github because it'll be nice for future users to see. So thank you for your patience and help trouble-shooting this pipeline for GESTALT data! Hope this helps.

oligomyeggo commented 3 years ago

Hi @mattjones315, thanks for all of this! Looks like I was thinking about lineage groups wrong. I was thinking lineage groups corresponded to the different branches/subclones that I might see (like a brain lineage group vs a heart lineage group; something like that). But yes, I think your understanding is correct and that a single zebrafish can be regarded as a single clone. Not to throw more wrenches into this, but for this pilot experiment we ended up pooling several barcoded zebrafish together for sequencing without any way to separate them. This is obviously problematic because we can't discern what barcodes came from what fish, but were hoping (way back when we planned this and before I had dug into bioinformatic options) that barcode diversity should be such that statistically we were really unlikely to have any barcode collision. We had assumed that if this trial run was successful that we would follow it up with a more in-depth and careful study where we kept track of the zebrafish. Can we treat this data as if it all came from one embryo, at least to see if we can get Cassiopeia working with GESTALT (even if the biological interpretations are incorrect)? At any rate, manually adding a lineageGrp column did fix the issue with the filtered_lineage_group_to_allele_table.

Thanks for taking the time to go back and check out that alignment strategy! The algorithmic side of bioinformatics is still a little over my head, so I would not have quickly picked up on that. I reran the preprocessing pipeline using the gap open and extend penalties that you recommend above for the align_sequences function. Doing that, my cell numbers in the umi_table were as followed at each step:

collapse_umis = 29565 resolve_umi_sequence = 1996 align_sequences = 1996 call_alleles = 1996 error_correct_umis = 1996 filter_molecule_table = 172

Looks like the biggest (and only) drops occurred after running resolve_umi_sequence and filter_molecule_table. I am not sure if I have mentioned this previously but our data is MiSeq data, so perhaps I need to reduce the QC thresholds like you mentioned?

mattjones315 commented 3 years ago

Hi @oligomyeggo, thanks for these numbers! And apologies for the delayed response.

To begin, yes - I believe you can treat each embryo as a clone. Unfortunately, I don't think you'll be able to disentangle each embryo from one another if they were pooled together using the call_lineages functionality, but you might be able to identify embryos by the indel information. However, this assumes that a unique indel was introduced in the one-cell stage for each embryo (this might not be reasonable, depending on how effective Cas9 was before the first cell-division). If an indel was not introduced at the one-cell stage, you might risk splitting up embryos into many clones. I definitely think this dataset will be useful for you still and great for debugging any experimental issues, but in the future I highly recommend either running each embryo separately or multiplexing using a tool like MULTI-seq or the Cell Multiplexing kit from 10X.

I'm assuming that these numbers were computed after each step, in which case you're absolutely right that the biggest drops are after resolve_umi_sequence and filter_molecule_table. Because you have pooled together several embryos, I would expect that our intradoublet detection (to filter out cells that have conflicting allele infroamtion for an intBC, perhaps indicating that the cell is a doublet) would cause lots of cells to be filtered out and could explain why filter_molecule_table throws out so many cells. I would advise removing this step by setting doublet_threshold = None in cas.pp.filter_molecule_table.

Also, because you're working with MiSeq data you might want to set the min_umi_threshold to be between 2 and 5 instead of 10.

From re-reading the statistics from the scGESTALT paper (Raj et al, Nature Biotech 2018), it appears that you might expect between 6-28% of profiled cells to have lineage barcodes. Assuming you have ~8500 cells, I'd say you're shooting for between ~500 and 2,380 cells. If by dropping the UMI threshold to 5 and removing the doublet filtering in cas.pp.filter_molecule_table results in something in the upper quantile of this range I'd say you successfully processed your data!

oligomyeggo commented 3 years ago

Hi @mattjones315, no worries about a delayed response! It was the weekend, and I am not in a super pressing rush.

Yes, I agree 100% that we should somehow separate the embryos before/during sequencing in future experiments (and should have done so in this experiment as well). But that's encouraging to hear that we can still work with this data for now. Thanks!

Yes, those were the cell numbers computed after each step. Following your suggestions I added the doublet_threshold = None and min_umi_per_cell = 5 to the filter_molecule_table function. Doing this increased the number of cells in my molecule table to 767 which seems much better and is in that 6-28% window fro the Raj et al. paper. Do you think I should try setting min_umi_per_cell to something lower like 2 and see if that further increases cell numbers?

And just to clarify: once I am happy with the output of filter_molecule_table, I next want to run the code you outlined above in order to get an allele_table. Then I am guessing I would run convert_alleletable_to_character_matrix, and input the resulting character_matrix into one of the CassiopeiaSolvers (any advice on how to pick the most appropriate algorithm? Maybe the HybridSolver?). And then , after a few post-processing steps, I will have lineage tree? Sorry if that's not quite right; I am looking through notebooks in both the master and refactor branches to get an idea for general workflow.

mattjones315 commented 3 years ago

Hi @oligomyeggo -- thanks for your understanding!

That's great to hear that you were able to salvage a few more cells by amending the settings a bit more! I would also try setting the min_umi_per_cell setting in the resolve_umi_sequence step as well if you have the time. Trying out a lower UMI threshold would also be great! In future datasets where you have separated embryos from one another, it would be nice to see how consistent your recovery is across embryos but getting ~10% of cells overall seems to be within the window presented in the Raj et al paper. Since I'm not an expert on the GESTALT technology, you might want to reach out to the authors to make sure that your recovery is consistent with their expectations.

As for next steps - yes, that's exactly right! You'll create character matrices (in future experiments, you'll want to create a character matrix for each embyro) and then reconstruct trees using one of our several CassiopeiaSolvers. I would suggest you trying out a few solvers -- they'll all run pretty quickly on your dataset (I expect all will run in less than a day -- some in a few minutes!). I would try out the VanillaGreedySolver, HybridSolver, and NeighborJoiningSolver modules first.

We're currently putting together a nice tutorial on reconstructing trees which we'll share as soon as possible. In the meantime, you can refer to our benchmarking notebook in cassiopeia/notebooks/benchmark.ipynb for the general solver interface. I will ping you on this thread when we have completed the reconstruction tutorial notebook.

As you are finishing up your analysis here, I was hoping you might be able to do us a favor. Since you are the first GESTALT user that we've closely worked with and have helped us re-parameterize our pipeline for this technology, I was hoping you might help us create a GESALT-processing config file (like the one found here). If this is of interest, could you just fill in the settings you used in your preprocessing and attach it to this issue so we can distribute it with the codebase?

Thank you so much again in advance, and hope all of this is helpful for you!

oligomyeggo commented 3 years ago

Hi @mattjones315, I took your suggestion and ran through everything again one more time adjusting the min_umi_per_cell option in the resolve_umi_sequence step to also be 5 (same as the min_umi_per_cell option in filter_molecule_table). This increased my number of cells from 767 to 768, so that didn't do too much. However, if I change both min_umi_per_cell parameters to 3 then I get 1084 cells which is more comfortably in that 6-28% range. I had comparable numbers if I set the resolve_umi_sequence min_umi_per_cell option to 5 and the filter_molecule_table min_umi_per_cell option to 3. If this lineage tree analysis looks promising then I think we'll do a more in-depth analysis and separate embryos (that's at least what I would recommend!), and I agree that it would be interesting to see how recovery looks across embryos. I will update you if (hopefully when) we do that experiment if it's helpful.

Thank you for suggesting a few solvers to try out! If I run into issues running any of those with my GESTALT data, should I continue posting them in this issue to keep all the GESTALT stuff together, or would be better to open a new issue? Happy to do whatever makes the most sense to you!

Great, thanks for the update re: the cassiopeia/notebook/benchmark.ipynb tutorial. I will play around with things and see how far I get for now, but am looking forward to checking the tutorial out.

I've attached a GESTALT-processing config file. Let me know if it looks OK, or if I should tweak anything. All of the parameters are accurate so far as what I used for the most recent preprocessing run through (setting min_umi_per_cell to 3 and returning 1084 cells; this is what I'll work with moving forward and if I run into any weird issues constructing lineage trees I might revisit the preprocessing and tweak something). I took out the convert, filter_bam and error_correct_cellbcs_to_whitelist, and error_correct_intbcs_to_whitelist steps since I didn't do those. I also took out the call_lineages step since it's not applicable to the GESTALT system (no intBCs). I am using lineage_utils.filtered_lineage_group_to_allele_table and utilities.filter_cells instead, but am not sure if they should be included in the config file (so I didn't include them, but can send you an updated version that does include them if you prefer).

If there is anything else that I can do on my end to help Cassiopeia process GESTALT data better please let me know! I am happy to help in anyway I can.

preprocess_gestalt.cfg.txt

mattjones315 commented 3 years ago

Hi @oligomyeggo, that's great you were able to recover another 300 cells! Please do keep me updated on these parameters work on individual embryos. I imagine other users will also find this helpful -- thanks again for your willingness to help adapt Cassiopeia here!

Regarding any reconstruction problems you encounter: I would suggest opening up an entirely new issue. This will help us, and future users, sift through problems others have encountered.

Thanks so much for putting together this config file for us! This is super helpful and we'll add this to the codebase for others. I'm realizing that there's one more parameter that could be causing you to lose lots of cells -- in cas.pp.filter_molecule_table, you're setting intbc_umi_thresh=10. This will throw out intBC that have fewer than 10 UMIs associated with them. Since you have only one "intBC" per cell (or TargetSite), this means you're essentially filtering out cells with fewer than 10 UMIs (which is not intentional). I think for the GESTALT system, we should set this parameter to be whatever the min_umi_per_cell parameter is - this might get you even more cells! Other than that, this config looks great.

One more word of caution -- keep an eye on the allele identities for reach cut-site (r1-r10). If you're seeing lots of missing data, this probably means that the alignment strategy is not working as well. We can continue to modify the parameters for the alignment, but I will also work on implementing a global alignment strategy that might be useful for GESTALT data.

If all of this seems to address your questions about the pre-processing pipeline, let me know and I'll close this issue. Thanks again!

oligomyeggo commented 3 years ago

Hi @mattjones315, I set the intbc_umi_thresh to be the same value (3) that I used for the min_umi_per_cell parameters. In this particular instance it did not change my cell numbers, but I agree that it makes sense to set the intbc_umi_thresh parameter to the same value as the min_umi_per_cell parameter for GESTALT data.

To clarify on the allele identities issue, how many missing allele identities for each cut site (r1-r10) would you consider to be problematic/raise a red flag? I'm guessing a missing identity here or there for scattered cells is fine/even expected. Is there a ballpark percentage that would make you question whether or not the alignment worked?

I have just one more question on the pre-processing pipeline before closing the issue. After I convert my umi_table to an allele_table, when I use utilities.filter_cells I drop down from 1084 cells to 628 cells, even though I am using the same parameters that I used in in upstream functions:

from cassiopeia.preprocess import utilities
allele_table = utilities.filter_cells(allele_table,
                                      min_umi_per_cell = 3,
                                      min_avg_reads_per_umi = 2,
                                      verbose = True)

Any idea why this additional filtering step is removing more cells even though the data should already have been filtered (such as in the resolve step) using min_umi_per_cell = 3 and min_avg_reads_per_umi = 2? Should I perhaps not include this additional and final filtering step?

mattjones315 commented 3 years ago

Hi @oligomyeggo -- great, good to know intbc_umi_thresh doesn't change anything for you, though it's probably good to keep it consistent with the config. I'll make that change before pushing to to our Github.

Regarding missingness - I would imagine you wouldn't want any entries in r1-r10 to be missing data. This is because GESTALT puts all the cut-sites onto the same transcript, so if you capture one cut-site, you should capture them all. Our missing-data is a bit different because we can identify which transcript we actually did not observe in a cell -- in the GESTALT technology, if you don't capture the transcript, you just won't have barcode information for that cell. That's at least my understanding. Thus, if you're seeing a lot of missing data, it could be because the alignment strategy still is not tuned for GESTALT data types -- I would expect you being able to get away with it for now as long (perhaps if the amount of missing data is <~30%) but we'll have to think about implement the global-alignment strategy that they describe in the original GESTALT paper if we can't find parameters in Cassiopeia that eliminates missing data.

Finally, regarding the last filtering -- I am not sure why this is happening. Because there might be a bug here (the way we report UMI numbers is different between umi_tables and allele_tables), can you plot the distribution of UMI counts / cell of the alelle_table before you pass it into that last filtering step?

P.S. Since we were talking about this yesterday, I wanted to follow up and let you know that we've created a reconstruction tutorial notebook. This notebook, along with the other notebooks, can be found on our documentation website here. Please let us know if there's anything unclear!

oligomyeggo commented 3 years ago

Hi @mattjones315, re: the missingness. When you say I wouldn't want any entries in r1-r10 to be missing data, does this mean an entire empty row for a cell in the allele_table or does this mean blank entries for any cut-site? For instance, if I have a cell that has sequence info for 8 cut-sites but has nothing for 2 cut-sites (as in a completely blank cell in the allele_table), am I interpreting those latter 2 cut-sites as missing data or something that was completely cut out by CRISPR as part of the GESTALT system? I am guessing they would correspond to missing data, but I just want to make sure I am understanding you correctly and interpreting my data in the right way.

With regards to that final filtering step, I've attached a scatter plot (actually two, one showing the whole distribution and one zoomed into the lowest UMI counts) showing the distribution of UMI counts/cell before passing it into the last filtering step. It looks like there are a ton of cells below my min_umi_per_cell threshold of 3, so I believe that that last filtering step is working correctly and some filtering step upstream is not, unless I am misinterpreting something?

Thanks again for all your help going through this! And wow, that was fast getting that reconstruction tutorial notebook together. I'm looking forward to working through that next.

umi_counts_per_cell_dist_zoom

mattjones315 commented 3 years ago

Hey @oligomyeggo,

To clarify what I mean by missing data: I think we're on the same page and this is not necessarily the same as a barcode not appearing for a cellBC you observed in your scRNA-seq data. The example you gave about the cell with 8 cut-sites observed and 2 not observed is exactly what I mean. These sites will show up as empty entries in your allele table. My concern is that these two empty sites might be an artifact of the local alignment strategy we have implemented and shown to work for our data, but not GESTALT data. Keeping an eye on this in your data will be helpful -- and you can probably get away with it for a little but perhaps might consider checking back later when we've offered a global alignment strategy in Cassiopeia.

I'm having trouble reading your plots -- could you plot the histogram of UMIs/cell with

allele_table.groupby('cellBC').agg({"UMI": "sum"}).plot(kind='hist')
plt.yscale('log')
plt.xlim(0, 50)

This will help us see if we have any cells at this point that are below the 10 UMI threshold.

oligomyeggo commented 3 years ago

Hi @mattjones315, perfect, thanks so much for clarifying the missing data. We are on the same page. I will dig into my allele_table in more detail sometime soon and let you know if I have a large percentage of missing data or not to help with the development of a global alignment strategy.

And sorry about the plots; please find attached the plot using your code: umi_counts_per_cell_hist

mattjones315 commented 3 years ago

Hi @oligomyeggo -- yes, it definitely looks like you have some cellBCs with fewer than 3 UMIs in your allele_table, so the last call filter_cells is working correctly.

I think the reason why you are observing this is because the last step of filter_molecule_table is to uniquely map TargetSites to an allele. If there is any disagreement (i.e., a given TargetSite appeared on several UMIs but with a different allele each time), then Cassiopeia will take the allele best represented and filter out all the TargetSite UMIs corresponding to disagreeing alleles. This is probably why some cells coming out of the filter_molecule_table step have fewer than 3 UMIs. If this is the case, then I think everything is behaving as expected.

oligomyeggo commented 3 years ago

Hi @mattjones315, sorry for the delayed response. Your explanation makes sense to me. Would you still recommend leaving in the additional and final utilities.filter_cells call and moving forward with 628 cells (which, while on the lower end, is still in that 6-28% recovery window)? Or is it sufficient/acceptable to leave off the final filtering step and move forward with the 1084 cells after running filter_molecule_table?

mattjones315 commented 3 years ago

Hi @oligomyeggo,

I would expect actually by this stage that most of the artifacts have been taken care of and you might be okay proceeding with the 1084 cells before running that last filter_cells step. However, I would suggest that since you are testing a lot of things here you might also want to proceed independently with the filtered dataset of 628 cells. I would keep an eye out to see if those extra cells you have in the larger dataset are placed oddly on the tree or appear in the RNA-seq dataset at all.

Let us know how it goes, and hope this helps!

Also -- just to let you know -- we have officially merged in all our code updates into the master branch on Cassiopeia. The code you're using is definitely stable, but keep an eye out as we add new functionalities for downstream analysis after tree reconstruction. As always please don't hesitate to create a new issue if you run into any trouble, have a suggestion, or have a question!

oligomyeggo commented 3 years ago

Awesome, thanks so much for all of your help @mattjones315! I appreciate it. I'll go ahead and close this issue, and open a new one if I run into issues at the tree reconstruction stage.