LungCellAtlas / HLCA

MIT License
45 stars 5 forks source link

emsembl ids #10

Closed hansen7 closed 11 months ago

hansen7 commented 12 months ago

Hi, thanks for the contribution!

Do you know what would the most ideal way to convert the gene names into the ensembl ids?

hansen7 commented 12 months ago

I am using the following function to convert gene names into ensembl ids, but for the total 28024 genes, there are 1659 gene name have multiple ensembl ids, and 6731 have no matching ensembl ids.

from biothings_client import get_client

def convert_to_ensembl(gene_names):
    mg = get_client('gene')
    response = mg.querymany(gene_names, scopes='symbol', fields='ensembl.gene', species='human', returnall=True)

    missing = response.get('missing', [])
    duplicates = response.get('dup', [])

    success = {}
    for item in response['out']:
        query = item.get('query', None)
        ensembl_data = item.get('ensembl', None)
        if ensembl_data:
            if isinstance(ensembl_data, list):  # Handle case where ensembl_data is a list
                ensembl_genes = [d.get('gene', None) for d in ensembl_data if 'gene' in d]
            else:
                ensembl_genes = [ensembl_data.get('gene', None)]
            if query:
                success[query] = ensembl_genes

    if missing:
        print(f"Missing: {missing}")
    if duplicates:
        print(f"Duplicates: {duplicates}")

    return success

gene_names = ['TP53', 'BRCA1', 'C1orf112', 'FAM214B', 'RTEL1-TNFRSF6B']  # Add your gene names
result = convert_to_ensembl(gene_names)
print(f"Successful conversions: {result}")
LisaSikkema commented 11 months ago

Hi @hansen7 , both gene names and Ensembl ids should be in the atlas object.. did you check? In adata.var

hansen7 commented 11 months ago

Hi @LisaSikkema, thanks, you are right! The ensembl_id is in the index column

LisaSikkema commented 11 months ago

Great, glad you found it!