provided dataset is not found

gouinK commented 4 years ago

Hi all, does anyone have an idea as to why the following simple query results in a "provided dataset is not found" error? I have used biomart in R for a long time, but am now moving all of my scripts to python and would really like this aspect to work. Thanks!

ds = 'hsapiens_gene_ensembl' bm = Biomart(host="www.ensembl.org",verbose=True) bm.new_query() bm.add_dataset_to_xml(ds) bm.add_attribute_to_xml('hgnc_symbol',dataset=ds) queries = ['PDCD1'] bm.add_filter_to_xml('hgnc_symbol',queries,dataset=ds) xml = bm.get_xml() res = bm.query(xml)

version: bioservices bioconda/noarch::bioservices-1.7.8-pyh864c0ab_0

cokelaer commented 4 years ago

@gouinK your code as provided is incorrect. the variable ds is not defined. In order to know the valid data set, use

bm.get_datasets(MART)

where MART is a valid MART e.g. ENSEMBL_MART_ENSEMBL.I hope this make sense.

I've been using this esrvice today from bioservices 1.7.8 and it worked perfectly well. The logic behind is not always straightforward but you know it from your previous experiences. In case it can help you also have this page from the documentation that may be useful: https://bioservices.readthedocs.io/en/master/biomart.html

gouinK commented 4 years ago

Hi there, the first line of my code snippet above is this: ds = 'hsapiens_gene_ensembl'

Is this incorrect?

cokelaer commented 4 years ago

@gouinK

Your code is correct, in particular the dataset. However, there is an issue with the filter itself.

There is no way for bioservices to check the expected type of the filter unfortunately as far as I know.

Here, you set

queries = ['PDCD1']

Instead of using add_filters_to_xml, which is a feature of biomart, I would recommend to to it a posterior using e.g. Pandas:

from bioservices import BioMart
b = BioMart()
ds = 'hsapiens_gene_ensembl'
b.new_query()
b.add_dataset_to_xml("hsapiens_gene_ensembl")
b.add_attribute_to_xml("hgnc_symbol", dataset=ds)
b.add_attribute_to_xml("ensembl_gene_id", dataset=ds)
b.add_filter_to_xml('hgnc_symbol', 'PDCD1', dataset=ds)
xml = b.get_xml()
res = b.query(xml)

Then to check the content in a nice Pandas DataFrame

import pandas as pd
import io
df = pd.read_csv(io.StringIO(res), sep='\t', header=None)
df.columns = ['hgnc', 'ensembl_id']

and you get::

PDCD1  ENSG00000188389 
PDCD1  ENSG00000276977

If you want to add another gene, use commas:

    b.add_filter_to_xml('hgnc_symbol', 'PDCD1,MT-TF', dataset=ds)

cokelaer / bioservices

provided dataset is not found #173