Open llumdi opened 3 years ago
@schmucr1 Shall we maybe also change the merge_mtx accordingly, so that the gene symbol becomes "CD4_antiobdy" instead of "CD4.1" for the protein abundance in such cases. What do you think @swalpe ? Or shall we make better use of Anndata.layers?
Please provide an example and let me know, you make the decision, how to create "new" (unique) names. I have never used this type of data, thus I do not understand how to best "rename" the protein names.
The suggested prefered way would be to use modify the symbol should be modified to the consensus gene_protein symbol and for the ID should be the uniprotID (or uniprotID_ensemblID) or a meaningful replacement when needed
I have included in merge_mtx
a test that gives a warning if there are duplicated gene symbols (2nd column) in the input feature files. Also, another test that checks whether the first column of "Antibody Capture" features are "uniprot ids" (the regex provided by uniprot) and the 2nd column are two symbols concatenated by underscore _ and each symbol is at least 3 letters (also by regex).
col 1 pattern (uniprot): "[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}"
col 2 pattern (protein gene symbols): "[[:alnum:]]{3,}_[[:alnum:]]{3,}"
Add here important/cautious notes:
merge_mtx
andbesca
will not raise error/warning but, when reading the matrix with citeseq=true, it will make the index (e.g SYMBOL) unique by appending a number string to each duplicate index element: ‘1’, ‘2’, etc. e.g if in genes.tsv you have:Then CD4 (for gene expression) will be converted to CD4.1. Important to consider this name change when ploting CD4 gene expression (otherwise can seem not expressed).
Suggestion: raise a warning in besca if there are duplicated names and indicate the name changes. (would not raise an error because the duplicated names come from the input 'feature_ref.csv' file before running cellranger)