SuLab / biogps_dataset

BioGPS.org dataset & dataset loading code repository
http://biogps.org/
0 stars 2 forks source link

Detailed steps needed to have a local development version of BioGPS, for dataset loading

Setting up your local development environment:

Make sure you use git for version control (May 2016 Biogps_dataset was migrated to Github)

Clone repository and make virtual environment (use your Github username of course!)

You must have three main components running in order to see datasets

1) SSH into the remote BioGPS database

Because the dataset database is much too large to install on computer for local development, you need to request a connection to our dev db server

2) Run the local host server

The settings_dev file is a "secret file." Please see Chunlei or BioGPS project manager.

3) Run Elastic Search

Install Elastic Search using these directions:

Elastic search is a search server based on Lucene.

It is a full-text search engine with an HTTP web interface and schema-free JSON documents.

Elastic search is developed in Java and is released as open source under the terms of the Apache License.

From within the elasticsearch folder that you set up, run:

Next, get data from a BioGPS user/researcher

You will need to get an info sheet, factors sheet and RNAseq data/matrix file from a scientist.

Does the local dataset you are loading have gene symbols in it and is it an RNAseq dataset?

If yes, then you must run reporter_to_entrezgene.py, which will use mygene.info to replace gene symbols with Entrezgene IDs.

Entrezgene IDs are absolutely necessary for Biogps.org data display, but for microarray datasets, keep the probe set reporters.

Dataset Parsers:

Run the command like this using Django manage.py, where "load_ds_local" can be other commands:

Then you must use the command es_index to "index the data", then the newly loaded dataset should appear in the chart file:

Output looks something like this:

Open this url and you should see bar charts!

http://localhost:8000/static/data_chart.html

Must sometimes restart the localhost and server that is containing the database, as well as elasticsearch.

For help:

Instances (models) to create during dataset loading:

If you don't know what a model is, then read about Django!

Biogps takes the average of samples for you so you don't need user average

Misc. information for testing/developing BioGPS:

urls from mygene.info used to get the Entrezgene IDs from gene symbol (from reporter_to_entrezgene):

To access the dataset via the shell:

Run these commands from shell:

This returns the dataset object which is the foreign key for dataset data and dataset matrix:

This returns all the metadata (from info sheet and factors):

Viewing datasets on your BioGPS localhost

Dropdown menu in "probeset" is also considered the reporter gene on BioGPS

Go to the URL for the specific gene and dataset name (primary key of dataset or geo_gse_id) geo_gse_id is also important: will be BDS_XXXXX next number in sequence)

Example dataset viewing urls:

Example admin:

Standard test gene is 1017, which is a human gene! So if you are using a mouse dataset, this will understandably be missing:

CDK2 cyclin-dependent kinase 2, Homo sapiens (human) Gene ID: 1017, updated on 6-Mar-2016

Cdk2 cyclin-dependent kinase 2, Mus musculus (house mouse) Gene ID: 12566, updated on 6-Mar-2016

You can also check the "fixed reporters" data file to see which Entrezgene IDs are actually in your dataset for viewing.

To view the full dataset (api) for a dataset and gene:

Misc. Information

Does your dataset have interesting tissue groups or organ systems?

If so, then change the color_idx in the json metadata (ex: admin/dataset/biogpsdataset/2509/) accordingly to group samples into meaningful groups. This is done manually due to the numerous variations of possible sample groupings

Make sure to run Flake8 (to check for Pep8 compliance), prior to pushing code to biogps_dataset repository.