SchubertLab / mvTCR

MIT License
45 stars 4 forks source link

Question about `v7_avidity.h5ad` and `*binarized_matrix.csv` #20

Closed yls2g13 closed 2 months ago

yls2g13 commented 2 months ago

Thanks again for a great package.

I'm new to scTCRseq analyses. I can see and download all the files needed to run the mvTCR tutorials, which have run smoothly How can I apply the same methods to my own 10x scTCRseq dataset? I'm having trouble understanding how I can obtain my own avidity data or binarized matrix as demonstrated in the tutorials. Can a user find or obtain this from their cellranger result folder?

Hope to hear from you.

WhatMelonGua commented 2 months ago

I'm new to TCR analysis with the same question,should I run the train script to get a new model from my own dataset, or just download them from trained: https://zenodo.org/records/8112246 ?

I'm not sure how many epochs i should choose, and how to explain the umap Could you please make a rough summary? I'm still not sure when to use the two tutorial applications (params_optimization with name: knn_prediction or persude_metric will get 2 different UMAP,i can't understand because I think the model's loss is always 'TCR + RNA reconstruct loss' + 'KLD' + 'supervised loss', why the knn_pred get more clusters)

Thank you very much!

irene-bonapa commented 2 months ago

Thank you for your interest. In our experiments, we train several models to select the hyperparameters that work better for each dataset (e.g. the weights of each modality in the loss function or the size of the encoder and decoder networks, learning rate...). Since this is an unsupervised task, it is not clear how to select the best model. For this reason, we use a surrogate task that gives us an idea of which latent space better represents our data. When in a dataset (besides RNA+TCR sequencing) antigen specificity is measured, then we recommend using this for hyperparameter tuning (tutorial 03), since it is a good reflection of T cell function. In this case, following 10x tutorials you would obtain the binarised matrix, that you can use to generate the v7_avidity.h5ad file (see here). However, in many scRNAseq+TCR experiments, antigen specificity is not measured. In this case, we use a pseudometric, such as segregation of cell types and clonotypes in the latent space for hyperparameter tuning (tutorial 2). @yls2g13 If this is your case, please train your model following tutorial 2.

In general you should train your own model for a new dataset. Regarding the number of epochs, usually we train our models with a high maximum number of epochs (e.g. 200), but stop the models via early stopping when the loss stops decreasing.

I hope this helps and let me know if you have further questions!

WhatMelonGua commented 2 months ago

Thank you for your reply! It's just too timely! Offering my utmost respect

yls2g13 commented 2 months ago

In this case, following 10x tutorials you would obtain the binarised matrix, that you can use to generate the v7_avidity.h5ad file (see here).

Regarding the above, in the tutorial mentioned, the code reads the binarised matrix file, and doesn't show how to generate it the binarised matrix:

# Binding data
     path_binding = path_base + f'vdj_v1_hs_aggregated_donor{i}_binarized_matrix.csv'
     binarized_matrix = pd.read_csv(path_binding, sep=',', header=0)

Apologies if I've misunderstood here!

irene-bonapa commented 2 months ago

Hi Nicole, the binarised matrices were directly downloaded from 10x and therefore not described in the tutorials. In the 10x application note (under "Preliminary analysis of T cells that bind to specific pMHC multimers", and also here) they describe how they generate them from the antigen specificity scores you will find under outs/per_sample_outs/<sample_name>/antigen_analysis/antigen_specificity_scores.csv after running the Cell Ranger multi pipeline on a dataset with GEX + VDJ + Antigen (BEAM) libraries.

There are different ways to define antigen specificity from binding scores, and this is out of the scope of mvTCR. But you could either follow the 10x approach or others such as described here (Fig 2) or here (third paragraph in the results section) or specialised methods as ICON.