Documentation for citeseq workflow missing

bedapub / besca

BESCA (Beyond Single Cell Analysis) offers python functions for single-cell analysis

https://bedapub.github.io/besca/

GNU General Public License v3.0

49 stars 16 forks source link

Documentation for citeseq workflow missing #161

Open llumdi opened 3 years ago

llumdi commented 3 years ago

Add here important/cautious notes:

Use an Antibody name different to the gene name. merge_mtx and besca will not raise error/warning but, when reading the matrix with citeseq=true, it will make the index (e.g SYMBOL) unique by appending a number string to each duplicate index element: ‘1’, ‘2’, etc. e.g if in genes.tsv you have:

1   CD4 Antibody Capture
ENSG00000010610 CD4 Gene Expression

Then CD4 (for gene expression) will be converted to CD4.1. Important to consider this name change when ploting CD4 gene expression (otherwise can seem not expressed).

Suggestion: raise a warning in besca if there are duplicated names and indicate the name changes. (would not raise an error because the duplicated names come from the input 'feature_ref.csv' file before running cellranger)

hatjek commented 2 years ago

@schmucr1 Shall we maybe also change the merge_mtx accordingly, so that the gene symbol becomes "CD4_antiobdy" instead of "CD4.1" for the protein abundance in such cases. What do you think @swalpe ? Or shall we make better use of Anndata.layers?

schmucr1 commented 2 years ago

Please provide an example and let me know, you make the decision, how to create "new" (unique) names. I have never used this type of data, thus I do not understand how to best "rename" the protein names.

swalpe commented 2 years ago

The suggested prefered way would be to use modify the symbol should be modified to the consensus gene_protein symbol and for the ID should be the uniprotID (or uniprotID_ensemblID) or a meaningful replacement when needed

schmucr1 commented 2 years ago

I have included in merge_mtx a test that gives a warning if there are duplicated gene symbols (2nd column) in the input feature files. Also, another test that checks whether the first column of "Antibody Capture" features are "uniprot ids" (the regex provided by uniprot) and the 2nd column are two symbols concatenated by underscore _ and each symbol is at least 3 letters (also by regex).

col 1 pattern (uniprot): "[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}"
col 2 pattern (protein gene symbols): "[[:alnum:]]{3,}_[[:alnum:]]{3,}"