Missing data - Githubissues

YosefLab / Cassiopeia

A Package for Cas9-Enabled Single Cell Lineage Tracing Tree Reconstruction

https://cassiopeia-lineage.readthedocs.io/en/latest/

MIT License

75 stars 24 forks source link

Missing data #136

Closed oligomyeggo closed 3 years ago

oligomyeggo commented 3 years ago

Hi again @mattjones315, I wanted follow up on the missing data issue that we discussed in issues #110 and #126. I went through the allele_table generated from my GESTALT data and counted up the number of instances I had a "Missing" entry/missing allele identity for each of my cut sites (r1, r2, r3, ... r10 for GESTALT). For cut sites r1-r8 there were no "Missing" entries. For cut site r9, ~24% of the entries were "Missing" and for cut site r10, ~36% of the entries were "Missing". I know you mentioned here that I can probably get away with missing data for now so long as it is less than ~30%, which is the case for r9 but not for r10. It seems interesting to me that the first 8 cut sites have no issues, and that missing data only increases at the end of my barcode array. Does this mean that in general the alignment is probably working, or do you think that we'll need to implement the global-alignment strategy described in the original GESTALT paper? Thanks!

mattjones315 commented 3 years ago

Hi @oligomyeggo ,

Apologies for not being prompt in my response here! We had some other issues that we noticed with missing data, especially when sequencing quality was low and were trying to come up with a best practice for this (see #139.)

The short answer is that we think that these are technical artifacts that should not be in the dataset. I think you might be losing meaningful information in r9 and r10 because of the local alignment strategy. We've made it a priority to try and fix this with a global alignment option but this is a bit tricky for some technical reasons (e.g., having to trim off the poly-A tail before global alignment). But in the meantime, we'd rather throw out cells with the missing information -- or, in your case, maybe you can just drop sites r9 and r10.

I would be curious to know if this is due to a sequencing quality issue. One way to try and assess this is to plot the indel heatmaps of trees (using, for example, cas.pl.upload_and_export_itol) -- does data look more "noisy" as you go from r1-r10?

Thanks and hope this helps. Sorry we don't have a solid fix for this yet!

oligomyeggo commented 3 years ago

Hi @mattjones315, no worries at all! I never expect you (or any dev) to be able to respond to all my questions right away. I really just appreciate that you do respond, and that you've been so willing to help me process my GESTALT data, so thank you!

I also think it's awesome that you are working on a global alignment option! That will be so helpful; I love your pipeline so far, so if it can be optimized to work with my kind of data that would be amazing. In the meantime, to drop sites r9 and r10, I'm guessing that would entail re-preprocessing the data and excluding those cutsites from the cassiopeia.pp.call_alleles function? Would I want to cut them out of my reference as well?

I looked the the indel heatmaps using the following code:

cassiopeia.pl.upload_and_export_itol(cas_tree,
                                     tree_name = "vanilla_greedy_tree_1084_missing_filt", 
                                     export_filepath="vanilla_greedy_tree_with_indel_heatmaps.png",
                                     allele_table = allele_table)

I don't necessarily think the data looks more noisy as it goes from r1-r10 (it kind of looks noisy overall), though I think maybe something funny happened with my plot? It goes from allele0 to allele19 (I think) for some reason, even though I only have cutsites r1-r10, so I am not sure how to interpret that. I've attached the png for you to see. Most cells stop at allele9 (I am guessing that what I call r1-r10 should correspond to iTOL's allele0-allele9?), and there are 7 cells that for some reason include allele10-allele19 information.

I wonder if some of the noise could potentially be due to the fact that I have combined multiple embryos but am treating them as one?

vanilla_greedy_tree_with_indel_heatmaps

mattjones315 commented 3 years ago

Hi @oligomyeggo ,

Thanks so much for your understanding! You'll be happy to know that we've begun on the global alignment option -- @Lioscro, another dev on the Cassiopeia team, took the initiative to start today! The PR is looking good and should be merged in next week.

As for reconstructing trees with the reduced set (without sites 9 and 10), you actually don't need to do any pre-processing -- just drop these two columns from the character matrix before reconstruction.

As for the extra sites in your tree plot, my best guess is that you actually have two different "intBCs" called by Cassiopeia -- what does allele_table['intBC'].unique() yield? If you two unique intBCs, this is probably because of variation in the first base pair of your sequencing reads (I imagine this is rare). Since you expect only one target site cassette per cell, you can fix this issue by setting allele_table['intBC'] = 'GESTALT' or some other arbitrary name. Then, reprocess this allele table into a character matrix and proceed as you did before.

Hope this helps!

oligomyeggo commented 3 years ago

Hi @mattjones315,

Oh, wow! The PR for the global alignment option is being implemented way faster than I thought, which is awesome! Thank you and @Lioscro! I look forward to checking that out, and hopefully it helps with the missing data I'm seeing in my r9 and r10 cutsites.

You are correct that I do have two different 'intBCs' in my allele_table. I am not sure how that happened, but most of my intBCs are 'A', while I have a few that are assigned the value 'Missing'. I'll changed all of these to something like 'GESTALT' per your recommendation, as well as dropping cutsites r9 and r10 and seeing how things look after that. Thank you!

In your opinion, do you think that my allele heatmaps looked particularly noisy and might indicate a sequencing quality issue, or do they look alright generally (it might be hard to interpret since it's a combination of 15 embryos masquerading as one clonal population)? We didn't have the most amazing sequencing quality for this particular run (58% >= Q30), but figured that that might have something to do with the polyT/A stretch, or just might be the last few cycles which tend be be lower quality, and decided it probably wasn't a big issue.

mattjones315 commented 3 years ago

Hi @oligomyeggo ,

No problem! Yes, I should really be thanking @Lioscro who took the initiative. We are finishing this up today I believe. One caveat, which we'd like to see how important it is on real data, is that we have not implemented a poly-A trimming step before global alignment. This can be a problem if your read contains a long poly-A stretch as this algorithm will try to align that nucleotide stretch to the reference. We used to handle this by first passing our sequencing data through a tool like cutadapt to remove these parts of the read, but we haven't implemented this in the codebase yet.

I'm glad we found the issue causing those "ghost" cassettes! That should clean things up.

Regarding how noisy the data looks - in my opinion, there are a few subclades (~5) that are clearly distinguished by unique indels, which is nice to see. These likely correspond to individual embryos. However, the fact that you had 15 embryos pooled together means that the remaining ~10 are pretty hard to discern. The good news is that it doesn't seem to be the case that the data gets noisier towards the end of the transcript - at least from what I can see here.

I think that between the global alignment strategy, and making sure that each intBC is referenced appropriately (i.e., named something like "GESTALT" like we said) will clean this up a bit. It will also be easier to discuss how noisy the data is when you can demultiplex the data into individual embryos. But like I said previously, I think this is a great dataset to get comfortable with the pipeline and you can likely still see a lot of structure by overlaying this lineage data onto the integrated RNA-seq data of these embryos!

oligomyeggo commented 3 years ago

Hi @mattjones315,

Great! This is all awesome to hear. Yeah, once I clean up this dataset a little more I was hoping I might be able to tease out at least a few individual embryos. I 100% agree (for any other GESTALT users that might eventually read through these issues) that having some way to demultiplex the data into individual embryos is absolutely key when it comes to extracting meaningful biological data. That said, I am definitely happy that the Cassiopeia pipeline has been able to work with my GESTALT data (and should get even better here soon!). And yes! I am hoping to somehow overlay the lineage data with our integrated scRNA-seq data and gain further insights. It's already been very cool taking the cluster IDs from our scRNA-seq data and seeing which cells those correspond to on our tree!

I look forward to trying out the new global alignment strategy, and am happy to share how it's working with my real GESTALT data if that helps further refine things. I noticed that some of the provided workflows start with fastq files, though the one I followed started with the .bam file generated by running Cellranger (I used Cellranger to process our MiSeq data). I am not sure if in the future it would make more sense for me to also start the Cassiopeia pipeline with my fastqs instead of the Cellranger-generated .bam file, but I am also happy to try that out!

mattjones315 commented 3 years ago

Hi @oligomyeggo,

Fantastic! Thanks for leaving these notes for future users.

Regarding where to start the pipeline - starting from the possorted_genome_bam.bam from cellranger is just fine. We implemented the extra functionalities to remove the dependency on cellranger (since some people might be coming from different technologies, etc) but the entire pipeline should play nicely with the .bam files you're used to working with.

Let's leave this issue open till you have a chance to try out the fixes we proposed above, as long as you don't mind reporting back how they worked out. Thanks in advance and best of luck!

oligomyeggo commented 3 years ago

Hi @mattjones315, I am trying out renaming my intBCs to 'GESTALT' and dropping r9 and r10 from my allele_table, and regenerating a character matrix and am running into a few issues. Namely, when I run the following code:

cell_meta = allele_table.groupby('cellBC').agg({"intBC": 'nunique', 'UMI': 'sum', 'Cluster': 'unique'})
cell_meta['Cluster'] = [x[0] for x in cell_meta['Cluster']]

missing_proportion = (character_matrix == -1).sum(axis=0) / character_matrix.shape[0]
uncut_proportion = (character_matrix == 0).sum(axis=0) / character_matrix.shape[0]
n_unique_states = character_matrix.apply(lambda x: len(np.unique(x[(x != 0) & (x != -1)])), axis=0)

character_meta = pd.DataFrame([missing_proportion, uncut_proportion, n_unique_states], index = ['missing_prop', 'uncut_prop', 'n_unique_states']).T

cas_tree.cell_meta = cell_meta
cas_tree.character_meta = character_meta

I receive the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-35-c16637991b20> in <module>
      5 missing_proportion = (character_matrix == -1).sum(axis=0) / character_matrix.shape[0]
      6 uncut_proportion = (character_matrix == 0).sum(axis=0) / character_matrix.shape[0]
----> 7 n_unique_states = character_matrix.apply(lambda x: len(np.unique(x[(x != 0) & (x != -1)])), axis=0)
      8 
      9 character_meta = pd.DataFrame([missing_proportion, uncut_proportion, n_unique_states], index = ['missing_prop', 'uncut_prop', 'n_unique_states']).T

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7766             kwds=kwds,
   7767         )
-> 7768         return op.get_result()
   7769 
   7770     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
    183             return self.apply_raw()
    184 
--> 185         return self.apply_standard()
    186 
    187     def apply_empty_result(self):

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
    274 
    275     def apply_standard(self):
--> 276         results, res_index = self.apply_series_generator()
    277 
    278         # wrap results

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
    288             for i, v in enumerate(series_gen):
    289                 # ignore SettingWithCopy here in case the user mutates
--> 290                 results[i] = self.f(v)
    291                 if isinstance(results[i], ABCSeries):
    292                     # If we have a view on v, we need to make a copy because

<ipython-input-35-c16637991b20> in <lambda>(x)
      5 missing_proportion = (character_matrix == -1).sum(axis=0) / character_matrix.shape[0]
      6 uncut_proportion = (character_matrix == 0).sum(axis=0) / character_matrix.shape[0]
----> 7 n_unique_states = character_matrix.apply(lambda x: len(np.unique(x[(x != 0) & (x != -1)])), axis=0)
      8 
      9 character_meta = pd.DataFrame([missing_proportion, uncut_proportion, n_unique_states], index = ['missing_prop', 'uncut_prop', 'n_unique_states']).T

<__array_function__ internals> in unique(*args, **kwargs)

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
    259     ar = np.asanyarray(ar)
    260     if axis is None:
--> 261         ret = _unique1d(ar, return_index, return_inverse, return_counts)
    262         return _unpack_tuple(ret)
    263 

~/.pyenv/versions/3.7.4/lib/python3.7/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
    320         aux = ar[perm]
    321     else:
--> 322         ar.sort()
    323         aux = ar
    324     mask = np.empty(aux.shape, dtype=np.bool_)

TypeError: '<' not supported between instances of 'int' and 'tuple'

Not totally sure why. Again, all I did was change intBC to be something consistent ('GESTALT') and dropped the r9 and r10 cutsite columns.

mattjones315 commented 3 years ago

Hi @oligomyeggo ,

Thanks for the comment here - but this seems to be an issue with the manipulation of the dataframe, not Cassiopeia. Can you confirm that the character matrix you get out of cas.pp.convert_allele_table_to_character_matrix works with the allele_table you updated to have the same set of intBCs? If so, then I am guessing that the issue is coming from the way you removed the last two columns of the character matrix, r9 and r10.

oligomyeggo commented 3 years ago

Hi @mattjones315, I misread your suggestion at first and dropped the r9 and r10 columns from my allele_table. I just went back and restored my original allele_table, set intBCs to GESTALT, and tried creating a new character matrix from that. Running cas.pp.convert_alleletable_to_character_matrix with my modified allele_table had no issue. I then dropped the r9 and r10 columns from the resulting character_matrix, but then I got the same error as above when trying to add cell and character meta data.

I did notice that the character_matrix definitely looks different now as compared to when I didn't set the intBCs to all be the same. It currently just has rows corresponding to my cutsites. When I had the two sets of intBCs, I had 10 additional columns of cutsites (r11-r20) which were populated with -1 (indicative of the missing data I believe). I imagine this is expected, but am wondering if it might somehow be part of the issue?

I've attached a toy example of my modified (i.e., updating to the same set of intBCs) allele_table and the resulting character_matrix after running:

character_matrix, priors, state2indel = cs.pp.convert_alleletable_to_character_matrix(allele_table,
                                                                                      allele_rep_thresh = 0.9,
                                                                                      mutation_priors = None)

I tried adding the cell/character meta data on this character_matrix without removing r9 and r10 and still had the same error, so to me it seems like something changed when I renamed the intBCs but I'm not sure what that could be. Hopefully it's something simple that I just completely overlooked.

toy_allele_table.csv toy_character_matrix.csv

EDIT: Following up on this, when I don't update my allele_table to have the same set of intBCs, I am able to delete cutsites r9-r20 from my character_matrix and all the downstream code runs fine. I am wondering, since I believe I only had only ~7 or so rows in my allele_table that had an intBC of "Missing" as opposed to "A", is there any reason why I should or should not just delete the handful of "Missing" rows and see if that fixes things?

mattjones315 commented 3 years ago

Hi @oligomyeggo,

I'm not sure what the problem is, as I can run the code without an error using the toy_allele_table.csv you provided.

Specifically, this is what I did after reading in the allele table (which I can see has all intBCs set to GESTALT as we discussed):

allele_table = pd.read_csv("toy_alelle_table.csv", sep=',')
character_matrix, _, _ = cas.pp.convert_alleletable_to_character_matrix(allele_table,
                                                                      allele_rep_thresh = 0.9,
                                                                      mutation_priors = None)
character_matrix.drop(columns=['r9', 'r10'], inplace=True)

The code cell that you tried to run that was giving you problems originally works fine if I create and filter the character matrix as above. As you said, using this logic we only have columns corresponding to sites r1-r8 (after filtering out r9 and r10) and the character matrix is well-formed by my eye and it can be used as expected.

I don't see any reason why changing all the intBCs to "GESTALT" would hurt anything, so I'm wondering if this has to do with how you were filtering the character matrix?

Thanks in advance for checking this out.

oligomyeggo commented 3 years ago

Hi @mattjones315 , I just tried deleting the rows from my allele_table that had an intBC of "Missing" (7 out of 820 rows, leaving me with 813). The remaining rows all have the same intBC. This seemed to have "fixed" things in that I am now able to run all the downstream code with the resulting character_matrix based on this allele_table.

I think my toy data sets probably worked for you because I wasn't grabbing any of the offending rows where the intBC was originally labeled as "Missing" (my bad!). I was filtering my character_matrix differently (I just said del character_matrix["r9"] and del character_matrix["r10"]), but even when I copied the code you provided above on my full data set I still got the same error. So I think there is something funky going on with those intBCs originally labeled as "Missing". It looks like they are duplicates of cells that have another entry with an intBC of A (though I don't know if all the indels are duplicates; on first glance it looks like the cellBCs are duplicates of other rows in the dataframe).

mattjones315 commented 3 years ago

Hi @oligomyeggo - I see, so it seems there was something more insidious going on with those Missing intBCs. Judging from your observation that there might be cells with intBCs that report both 'Missing' and 'A', then the problem might have been originating from the fact that there were duplicate cellBCs in your allele_table that hadn't been collapsed since they had the same intBC. This should theoretically be taken care of in the call_lineages step, but since we were just modifying a processed allele_table after the fact there might have been some issues that we created. Good to know! I wonder if you passed your modified allele_table through cas.pp.lineage_utils.filtered_lineage_group_to_allele_table one more time, that would fix all of this.

In any event, I'm glad this fixed your problems. I'm hoping that using the global alignment strategy will preemptively fix these issues down the road. For now, glad things are working well for you!

oligomyeggo commented 3 years ago

Yes, I think that's what is going on! I have 7 cellBCs that are duplicated and report both 'Missing' and 'A'. Only one of the duplication pairs has the same allele/indel info (but different UMIs/readCounts) - all the other duplication pairs have different everything. I tried passing my modified allele_table back through cas.pp.lineage_utils.filtered_lineage_group_to_allele_table, but it didn't do anything. Would you recommend just dropping these cells from the allele_table since they have conflicting information? I've attached a csv of the duplicated cellBCs - just in case it's helpful for you to look at. allele_table_dupes.csv

mattjones315 commented 3 years ago

Hi @oligomyeggo,

Yeah, it appears that the actual alleles are different between the two types of intBCs, so I think that it would not be ideal to pass it back through cas.pp.lineage_utils.filtered_lineage_group_to_allele_table after all. Perhaps the best solution for right now is to filter them out, and hopefully the global alignment strategy fixes it.

oligomyeggo commented 3 years ago

Great, thanks @mattjones315. That sounds good to me. Can I ask one more question before we close this issue, as it relates to the indel heatmap? I just want to make sure that I am interpreting it correctly. Based on your paper, gray means that a cut site was unedited, and then a color means that there was some sort of edit (and these colors are random, and don't correspond to any particular type of edit like an insertion or deletion), correct? And I am guessing that for each cut site, different colors means different edits and the same color means that cells have the same edits at that particular cut site. So in this image, the cells in group A have the same indels/edits for every cut site except the first one (the inner most ring). Likewise, the cells in group B largely have the same indels/edits for every cut site. However, cells in group C seem to have all kinds of different indels/edits going on. And so while I could say that cells in group A and B are probably related and are derived from the same embryo (i.e., group A comes from some lineage in embryo A and group B comes from some lineage in embryo B), I can't really make any conclusions about cells in group C. The sort of "confetti" appearance in the heatmap for group C is most likely due to the fact that there are 15 embryos combined in this data set, and since I only recovered barcodes for about ~10% of transcriptionally profiled cells, it makes sense that there would be a good chunk of cells that belong to sort of "lost" lineages that aren't represented/captured in the data/heatmap. Does this sound right, or am I a little off-base?

vanilla_greedy_tree

mattjones315 commented 3 years ago

Hi @oligomyeggo ,

Happy to help here!

Yes, your understanding is generally correct. Each unique color is a unique indel, grey represents an uncut site, and white typically represents missing data (which I believe you don't have any, given the way you analyzed your data).

Both groups A and B indeed have a large number of sites that all share the same indel pattern. The fact that there is a lot of supporting information at the indel level that supports the tree structure is nice to see!

Group C is a heterogenous set of states. It appears that there are no indels that are shared by the cells in this group. Indeed, this could be because you've pooled together several embryos that are unrelated to one another, each of which is sampled at different rates. And you're additionally correct, I think, that you might see this even within a single embryo because you're sampling a fraction of cells in each lineage -- this can create this same type of noise (we've seen it in our cancer datasets too). These two reasons are expected, but a third reason could be because of the noise in the alignment (a recurring theme). I would check out the indels in these cells and see how far off they are -- you might be able to recover relationships if this noise is coming from a technical problem like the alignment strategy!

Hope this helps, and thanks for keeping us in the loop!

oligomyeggo commented 3 years ago

Excellent, thanks so much for clarifying all of that @mattjones315!! I'll go ahead and close this issue, and will check back in with a new issue once I have had a chance to use the new global alignment strategy (I see it was merged the other day!).

mattjones315 commented 3 years ago

Thanks @oligomyeggo!

Good news -- we actually merged in the global alignment strategy earlier this week with #142! If you pull & reinstall, it should be ready to use.

oligomyeggo commented 3 years ago

Yes, I just saw that and edited my comment, but was a little too slow haha. Thanks so much! Looking forward to seeing if it helps clean up my data any!