Pull in more info per tcr and bcr clonotype

dtm2451 commented 5 months ago

Addresses #68

Currently only updates the load_clonotypes util function. Thus, adds grab of additional columns but does not change what is done with the data afterwards (a.k.a. all just gets shoved into the "processed" Seurat object metadata)

Exact desired behavior is still TBD, but there are currently a few notable differences compared to previous/current behavior:

the previous "cdr3s_aa" column, from clonotypes.csv, is named "cdr3" in the all_contig_annotations.csv and is not currently updated to match the previous name
the data elements of the previous "cdr3s_aa" column also got prefixed as "TRA:" / "TRB:" / but I don't recapitulate that here currently as the info can be inferred from any of the newly grabbed "v/d/j/c_gene" columns.
(also the "clonotype_id" column is nolonger output as does not seem useful!)

ToDo:

[ ] Finalize desired behaviors
[ ] Update around those
[ ] Remove 'clonotype_path' input and determination / channeling of that file from pipeline code

erflynn commented 5 months ago

this is great! one thought -- this adds a lot of columns to the metadata? I dont know how we feel abt this, I wonder if we want to put some of this is in misc() or something?

Do we want to include the d.gene? It's empty for most but not all of the data (for the library I was looking at -- 75% BCR had no d gene, 82% of TCR had no d gene), and where it is present it's frequently partial. Also @AlaaALatif mentioned he did not use it. For the FWR -- I think this could be a flag? e.g. could include or could skip... Not sure, or we could just lean toward including everything and folks could remove later.

Now the CDR3aa no longer contains the prefix for the IG or TR -- I think other folks probably have more insight, but I had used that for looking for complete sequences. I know we can grab from V, J, or C -- so maybe not needed, but worth checking with @ravipatel4 or others if we should still keep that in.

dtm2451 commented 5 months ago

Thanks @erflynn!

I definitely agree regarding this creating a lot of meta.data bloat, I was thinking about whether we might want to load just some columns into the Seurat object, and have the full set (or fewer) grabbed here go into a metadata file.

For d_gene, yea, I just figured it might be useful sometimes and was erring towards collecting all the things we might try to use. Can def leave out?

FWR flag - could do. Just again thought it doesn't hurt to collect, aside from the meta.data bloat that we may deal with another way. I don't think it hurts to collect all the time, but could see wanting to keep the output slimmer for immediate user understand-ability purposes?

And the loss of "TRA:"/"TRB:" bits - I left this out because I thought I'd want to add it to most of the columns, which felt silly, & because of not feeling ready to commit code on it without testing for BCR data. And TCR gamma delta? But it's probably not too hard and maybe just a substr(<c_gene>, 1,3) could suffice!

Definitely interested in notes from @ravipatel4 and @AlaaALatif! Happy to alter however we decide!

erflynn commented 5 months ago

another couple thoughts:

I've seen partially missing C genes in a not insignificant portion of cells (~10% of BCRs). I assume we want to load in but maybe other folks can sanity check inclusion of this?
do we want to do grouping/filtering to select one set of sequences per chain/cell at this step? i.e. If there are two heavy chains for a particular cell, select the one with the max UMIs (break ties based on number of reads)? Suggesting this because it would be easier to do at this step than later once we've collapsed the data.

erflynn commented 5 months ago

FYI have confirmed nextflow logic works with removal of clonotypes.csv using toy data. I think once we confirm what behavior is desired in terms of columns pulled in / whether to filter, we should be good to merge.

UCSF-DSCOLAB / data_processing_pipelines

Pull in more info per tcr and bcr clonotype #69