add more species to icebox server

nathandunn commented 9 years ago

This is in support of the monarch project, but will also test the system and provide some good homologous test data.

nathandunn commented 9 years ago

Basically, for each organism, I want a track of genes, and the conservation of the genome, if available. Easy peasy. For human and mouse, there’s some extra stuff, which are regulatory tracks (that’s for another project we’re getting started). For testing purposes, you could just start with one regulatory/enhancer track (rather than the 10-20), and we’ll see how it looks.

Then, we should investigate if it is better for us to host features to show on one or more additional track(s), or if we should dump them en masse for a jbrowse server. I am not sure what the right workflow should be for the best performance.

[ ] Generate names for all once set (and should probably include the ENSDARG if possible)
[x] default BigWig should be blue / gray and height 30, no variance and include both XY and Density
[ ] a "collapsed" transcript view for evidence (similar to what we do for annotation) would be great (displayMode = 'collapsed')
[ ] add "taxon" identification to organism?!? (may require another bug)
[x] format chromosomes to chrN (don't necessarily need the others, but can't hurt)

These species are core, and in order of importance:

Human (we have this data)

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/homo_sapiens/dna/
[ ] Conservation (99 vertebrates): http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phastCons100way/
[ ] Genes: ftp://ftp.ensembl.org/pub/release-81/gff3/homo_sapiens:
- [ ] Human: type none (wire)
- [ ] Human: type transcript
- [ ] Human: type transcript,mRNA
- [ ] Human: type transcript:ensembl_havana
[ ] Regulatory features: ftp://ftp.ensembl.org/pub/release-81/regulation/homo_sapiens/
[ ] Enhancers: http://fantom.gsc.riken.jp/5/datafiles/latest/extra/Enhancers/human_permissive_enhancers_phase_1_and_2.bed.gz

Mouse

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/mus_musculus/dna/
[ ] Conservation (59 vertebrates): http://hgdownload.cse.ucsc.edu/goldenPath/mm10/phyloP60way/
[ ] Genes: ftp://ftp.ensembl.org/pub/release-81/gff3/mus_musculus
[ ] Regulatory features: ftp://ftp.ensembl.org/pub/release-81/regulation/mus_musculus/
[ ] Enhancers: http://fantom.gsc.riken.jp/5/datafiles/latest/extra/Enhancers/mouse_permissive_enhancers_phase_1_and_2.bed.gz

Zebrafish

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/danio_rerio/dna/
[x] Conservation (7 genomes): http://hgdownload.cse.ucsc.edu/goldenPath/danRer7/phastCons8way/
[x] Genes: ftp://ftp.ensembl.org/pub/release-81/gff3/danio_rerio

Drosophila

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/drosophila_melanogaster/dna/
[ ] Conservation (26 insects): http://hgdownload.cse.ucsc.edu/goldenPath/dm6/phyloP27way/
[ ] Genes: ftp://ftp.ensembl.org/pub/release-81/gff3/drosophila_melanogaster

C. elegans

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/caenorhabditis_elegans/dna/
[ ] conservation (6 worms): http://hgdownload.cse.ucsc.edu/goldenPath/ce10/phastCons7way/
[ ] genes: ftp://ftp.ensembl.org/pub/release-81/gff3/caenorhabditis_elegans

Others that would be nice to have, loosely in order of preference:

Dog (3 vertebrates):

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/canis_familiaris/dna/
[ ] conservation: http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/multiz4way/
[ ] genes: ftp://ftp.ensembl.org/pub/release-81/gff3/canis_familiaris

Pig

[x] dna genome: ftp://ftp.ensembl.org/pub/release-81/fasta/sus_scrofa/dna/
[ ] gene annotations: ftp://ftp.ensembl.org/pub/release-81/gff3/sus_scrofa

Cow (compare against other cow):

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/bos_taurus/dna/
[ ] genes: ftp://ftp.ensembl.org/pub/release-81/gff3/bos_taurus

Chicken

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/gallus_gallus/dna/
[ ] conservation (6 vertebrates): http://hgdownload.cse.ucsc.edu/goldenPath/galGal3/phastCons7way/
[ ] genes: ftp://ftp.ensembl.org/pub/release-81/gff3/gallus_gallus

Sheep:

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/ovis_aries/dna/
[ ] genes: ftp://ftp.ensembl.org/pub/release-81/gff3/ovis_aries

Horse

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/equus_caballus/dna/
[ ] genes: ftp://ftp.ensembl.org/pub/release-81/gff3/equus_caballus

Cat (3 vertebrates):

[x] sequence: ftp://ftp.ensembl.org/pub/release-81/fasta/felis_catus/dna/
[ ] genes: ftp://ftp.ensembl.org/pub/release-81/gff3/felis_catus

nathandunn commented 9 years ago

all data downloaded

cmdcolin commented 9 years ago

Lemme know if you have any questions about loading the datasets from ensembl/refseq/ucsc... there are some tricks sometimes

nathandunn commented 9 years ago

Most of it I have.... I have to rewrite some of the FASTA files to create better chromosome names.

My bigger question is... what is the preferred method for importing the GFF3 gene tracks? I always forget the type or if its necessary

For example bin/flatfile-to-json.pl --gff /data/jbrowse/monarch/cow2/raw/Bos_taurus.UMD3.1.81.gff3 --out /data/jbrowse/monarch/cow2 --trackLabel Cow2

You can see these look pretty horrible . . I can retry with —-type=gene, mRNA, transcript . . etc. etc. and I also need to exclude the chromosome:

http://icebox.lbl.gov/Apollo2/jbrowse/index.html?loc=1:151964931..158337067&organism=158569&tracks=Cow2 http://icebox.lbl.gov/Apollo2/jbrowse/index.html?loc=1:151964931..158337067&organism=158569&tracks=Cow2

Anyway, any pointers would be great,

Nathan

On Sep 15, 2015, at 11:58 AM, Colin Diesh notifications@github.com wrote:

Lemme know if you have any questions about loading the datasets from ensembl/refseq/ucsc... there are some tricks somtimes

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/568#issuecomment-140501012.

cmdcolin commented 9 years ago

The ensembl GFF3 normally use "transcript" instead of "mRNA" for their transcript types, so just pass that to the --type argument.

bin/flatfile-to-json.pl --type transcript --gff "sorted gff file" --out /opt/apollo/organism --trackLabel Ensembl_transcripts

There are also some cool extra filters that you can add to the --type argument too

For example, you can also load multiple types into one track

 bin/flatfile-to-json.pl --type transcript,mRNA --gff "sorted gff file" --out /opt/apollo/organism --trackLabel Ensembl_transcripts_and_mRNA

That would load both "transcript" and "mRNA" from column 3

You can also filter on column 2 (source) and column 3 (type) simultaneously

 bin/flatfile-to-json.pl --type transcript:ensembl_havana --gff "sorted gff file" --out /opt/apollo/organism --trackLabel Ensembl_havana_transcripts

That would only load the havana sourced transcripts into the track

nathandunn commented 9 years ago

Ha, so they were all right! I think I also inherited some of the Apollo trackList decoration, as well.

Nathan

On Sep 15, 2015, at 12:15 PM, Colin Diesh notifications@github.com wrote:

The ensembl GFF3 normally use "transcript" instead of "mRNA" for their transcript types, so just pass that to the --type argument.

bin/flatfile-to-json.pl --type transcript --gff "sorted gff file" --out /opt/apollo/organism --trackLabel Ensembl_transcripts There are also some cool extra filters that you can add to the --type argument too

For example, you can also load multiple types into one track

bin/flatfile-to-json.pl --type transcript,mRNA --gff "sorted gff file" --out /opt/apollo/organism --trackLabel Ensembl_transcripts_and_mRNA That would load both "transcript" and "mRNA" from column 3

You can also filter on column 2 (source) and column 3 (type) simultaneously

bin/flatfile-to-json.pl --type transcript:ensembl_havana --gff "sorted gff file" --out /opt/apollo/organism --trackLabel Ensembl_havana_transcripts That would only load the havana sourced transcripts into the track

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/568#issuecomment-140506023.

nlwashington commented 9 years ago

tagging myself @nlwashington

nathandunn commented 9 years ago

Discussion with @nlwashington, looked at human and zebrafish. Changes added to todo-list.

having both gene and transcript is good, but we want to be able to collapse the transcript in evidence
bigwig tracks should be about 30 pixes high, no variance, XY & Density plots
don't need fragmented scaffolds, but okay to have
need to be able to lookup by identifier, not just name (if possible), though can grab the coordinates
would be good to add the tax on ID

nathandunn commented 9 years ago

@nlwashington . .

[ ] Waiting for Monarch team to get their own server.

I think I want to create a mildly separate build off of 2.0 / master branch that consists of:

[ ] collapsible evidence track on evidence (already implemented in 2.1) . . a very simple back-port
[ ] inclusion of Neato plugin to properly draw genes

But without being dependent on 2.1. Should be pretty quick to get up and running . . can ssh over existing data.

GMOD / Apollo

add more species to icebox server #568