UCSF-DSCOLAB / data_processing_pipelines

A repository to store the existing pipelines to process the various CoLabs datasets
0 stars 1 forks source link

Pull in more info per tcr and bcr clonotype #69

Open dtm2451 opened 5 months ago

dtm2451 commented 5 months ago

Addresses #68

Currently only updates the load_clonotypes util function. Thus, adds grab of additional columns but does not change what is done with the data afterwards (a.k.a. all just gets shoved into the "processed" Seurat object metadata)

Exact desired behavior is still TBD, but there are currently a few notable differences compared to previous/current behavior:

ToDo:

erflynn commented 5 months ago

this is great! one thought -- this adds a lot of columns to the metadata? I dont know how we feel abt this, I wonder if we want to put some of this is in misc() or something?

Do we want to include the d.gene? It's empty for most but not all of the data (for the library I was looking at -- 75% BCR had no d gene, 82% of TCR had no d gene), and where it is present it's frequently partial. Also @AlaaALatif mentioned he did not use it. For the FWR -- I think this could be a flag? e.g. could include or could skip... Not sure, or we could just lean toward including everything and folks could remove later.

Now the CDR3aa no longer contains the prefix for the IG or TR -- I think other folks probably have more insight, but I had used that for looking for complete sequences. I know we can grab from V, J, or C -- so maybe not needed, but worth checking with @ravipatel4 or others if we should still keep that in.

dtm2451 commented 5 months ago

Thanks @erflynn!

I definitely agree regarding this creating a lot of meta.data bloat, I was thinking about whether we might want to load just some columns into the Seurat object, and have the full set (or fewer) grabbed here go into a metadata file.

For d_gene, yea, I just figured it might be useful sometimes and was erring towards collecting all the things we might try to use. Can def leave out?

FWR flag - could do. Just again thought it doesn't hurt to collect, aside from the meta.data bloat that we may deal with another way. I don't think it hurts to collect all the time, but could see wanting to keep the output slimmer for immediate user understand-ability purposes?

And the loss of "TRA:"/"TRB:" bits - I left this out because I thought I'd want to add it to most of the columns, which felt silly, & because of not feeling ready to commit code on it without testing for BCR data. And TCR gamma delta? But it's probably not too hard and maybe just a substr(<c_gene>, 1,3) could suffice!

Definitely interested in notes from @ravipatel4 and @AlaaALatif! Happy to alter however we decide!

erflynn commented 5 months ago

another couple thoughts:

erflynn commented 5 months ago

FYI have confirmed nextflow logic works with removal of clonotypes.csv using toy data. I think once we confirm what behavior is desired in terms of columns pulled in / whether to filter, we should be good to merge.