galaxyproject / usegalaxy-playbook

Ansible Playbook for usegalaxy.org
Academic Free License v3.0
30 stars 24 forks source link

Genome Additions Master Ticket #242

Open jennaj opened 8 years ago

jennaj commented 8 years ago

Genome and indexes for CVMFS and http://usegalaxy.org

CONVERT THIS ISSUE TO A PROJECT @jennaj

This list changes over time as new data sources are targetted for indexing and user requests are considered. See posts below for genome batches completed and in progress.

Current plans are to bring http://usegalaxy.org up to date with UCSC's released genomes, indexed for all tools, so those do not need to be requested by users at this time.

Main:

Other reference data:

Admin/Local Data and DM usage enhancements are included here.

Resolved data issues:

New genomes and indexes will be installed at https://test.galaxyproject.org/ first for testing. If your genome is listed and checked as complete, community testing and feedback can be posted to https://biostar.usegalaxy.org or through a bug report from the error dataset (from a mapping tool, etc).

All data will later be promoted to http://usegalaxy.org. Timeline is not firm.

Making a Reference genome request

jennaj commented 8 years ago

For reference,

Master spreadsheet of dbkeys and indexes done and to-do. Older genomes removed. https://docs.google.com/spreadsheets/d/1jtDC-2STroUINP6KVrfhZwGQgpP5y-HhkRMONZtD1W4/edit?usp=sharing

dbkeys with fasta loaded. https://gist.github.com/jennaj/aeb8d6af4e4722a89f62d15af8ce3452

jennaj commented 8 years ago

Issues detected

Meets goal of consistency in nomeclature permiting DMs to function

change genome label in all_fasta (kill "full" in all descriptions/dbkeys). Other locs may need mods.

jennaj commented 8 years ago

Genomes that need followup:

Not at http://genome.ucsc.edu (browser or download). Are at http://genome-test.cse.ucsc.edu/. Pending release?

jennaj commented 8 years ago

Completed and indexes promoted to http://usegalaxy.org (Galaxy Main) April 2016

Fasta

New genomes (confirmed, to be indexed for all)

2bit

Note: Lastz indexes created by same DM

New genomes

Existing

Sam

New genomes

Existing

Picard

New genomes

Existing

Bowtie2/Tophat2

Issue about Bowtie2 DM creating duplicate indexes: https://github.com/galaxyproject/tools-devteam/issues/319

New genomes

Existing

BWA/BWA-MEM

New genomes

Existing

HISAT2

New genomes

Existing

Liftover

See distinct tracking checklist, below

jennaj commented 8 years ago

2018

Fasta

New genomes (confirmed, to be indexed for all)

2bit

Note: Lastz indexes created by same DM

New genomes

Existing

Sam

New genomes

Existing

Picard

New genomes

Existing

Bowtie2/Tophat2

New genomes

Existing

BWA/BWA-MEM

New genomes

Existing

HISAT2

New genomes

Existing

Liftover

See distinct tracking checklist, below

RNA STAR

Fast tracked genomes https://github.com/galaxyproject/galaxy/issues/1470#issuecomment-307517254

New genomes

Existing


jennaj commented 8 years ago

New genomes under review (source/licence)

jennaj commented 8 years ago

Liftover

Needs DM: https://github.com/galaxyproject/galaxy/issues/1904

Workaround: Use the LiftOver tool at UCSC (the source for the wrapped version in Galaxy) and upload the results to Galaxy to use with other analysis. http://genome.ucsc.edu/cgi-bin/hgLiftOver

New genomes

Existing (update)

natefoo commented 8 years ago

@jennaj I updated the April 2016 comment to include the missing BWA indexes that I was able to build with the BWT-SW algorithm.

Some (like galGal3 and panTro3) with full/canonical variants I rebuilt. The only difference from the original DM run is that after selecting the correct build from "Source FASTA Sequence", I put the build variant name (e.g. galGal3canon) in the "ID for sequence" field. Otherwise the builds clobber eachother in the index dir on disk (the "ID for sequence" field is used for naming the index subdirectory and defaults to the dbkey - which for both full and canonical builds is still just e.g. galGal3 - this could be a bit more intuitive in the DM, I had no idea what "ID for sequence" was for until I noticed that two loc file entries pointed to the same directory/indexes on disk and then dug into the DM code to understand it). I rebuilt these for any indexes which had the variants built originally, and cleaned up the old directories and their entries in the location files.

These BWA indexes and the rest of the indexes in that comment are now in the process of being published to CVMFS and once done (this may take a long time) will be available on usegalaxy.org (after a restart, I'll comment again when it's all ready).

natefoo commented 8 years ago

@jennaj The publishing is finished and Main has been restarted.

jennaj commented 8 years ago

Add hg38 MAF alignments. Request: https://biostar.usegalaxy.org/p/17690

massaali commented 7 years ago

Hello,

I saw 7 weeks ago that another user had made this same request for a newer version of the sheep reference genome - you currently have OviAri1 which is 6 years old and there are two newer versions (about to be 3 newer versions) could we get a newer version? Sheep are amazing agricultural species important for meat milk and wool production and more researchers should study them! I request the current version on NCBI/ENSEMBL for all tools Bowtie and mapping tools, and chIP-seq, RNA-seq tools too: Ovis aries Oar_v4.0 its from late 2015.

Thank you for considering!

sayalih commented 7 years ago

Hi

I think the best way to is to use your genome of interest - use the fasta format and upload it on galaxy using firezilla. And there is an option to align with your uploaded sequence instead of the reference genome. Links to how to do this: https://wiki.galaxyproject.org/Support#Custom_reference_genome

I don't think they are uploading any more reference genomes on their default list.

Sayali.


Update by @jennaj: Yes, use a custom reference genome for now. I will add in sheep and other requests to the next list of updates https://github.com/galaxyproject/galaxy/issues/1470#issuecomment-208444904

vebaev commented 7 years ago

It will be great if you include the tomate 2.40 genome from: ftp://ftp.solgenomics.net/tomato_genome/ And pepper C.annuum_cvCM334 from: ftp://ftp.solgenomics.net/genomes/Capsicum_annuum/C.annuum_cvCM334/

@jennaj Yes, they are in NCBI (tomato and pepper): https://www.ncbi.nlm.nih.gov/genome/7 https://www.ncbi.nlm.nih.gov/genome/10896

jennaj commented 7 years ago

Priority indexes

RNA STAR



Indexes


Future requests (may be moved to a new post in this same issue)

iraplee commented 7 years ago

We're looking for X. tropicalus index to be uploaded to HiSat2

xenTro1 xenTro1 Frog (Xenopus tropicalis): xenTro1 /galaxy/data/xenTro1/seq/xenTro1.fa xenTro2 xenTro2 Frog (Xenopus tropicalis): xenTro2 /galaxy/data/xenTro2/seq/xenTro2.fa xenTro3 xenTro3 Frog (Xenopus tropicalis): xenTro3 /galaxy/data/xenTro3/seq/xenTro3.fa

bimbam23 commented 7 years ago

New Pig genome: Sus Scrofa 11.1, susscr4 NCBI GCF_000003025.6 all: (chr and chrUn plus chrMT)

genome: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/GCF_000003025.6_Sscrofa11.1_genomic.fna.gz gff3: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/GCF_000003025.6_Sscrofa11.1_genomic.gff.gz

lookup table nice names: https://test.galaxyproject.org/u/bickj/h/pig-genome-lookup-table

PseudomonasP commented 6 years ago

Dear Galaxy Team, I hope this is still the right place to request genome additions.

If we could get Brassica napus (Bna) as a built-in genome, that would be amazing: http://www.genoscope.cns.fr/brassicanapus/data/

Please note that although the annotation is titled v5 while the genome itself is v4.1, it should work just fine, as we have had no problems with it.

jennaj commented 6 years ago

Add NCBI's Xenopus laevis and Xenopus tropicalis genomes (indexed for all tools).

The genome is at https://usegalaxy.eu -- so when we get the data synced between all mirrors that might be the best solution.

Request: https://biostar.usegalaxy.org/p/27778

jennaj commented 5 years ago

Request: add Medicago truncatula https://biostar.usegalaxy.org/p/5916/#30132

To-do: Check if present in ELIXER plant genomes already indexed (to be added in cvmfs): https://www.elixir-europe.org/about/groups/galaxy-wg

jennaj commented 5 years ago

Request: add https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Canis_lupus_dingo/100/ re: https://help.galaxyproject.org/t/dingo-reference-genome-upload-request/529

To-do: Check if UCSC has processed the genome

JulienLeclercq commented 5 years ago

Dear Galaxy Team,

Thanks for your amazing work. Please kindly consider adding the following genome to Galaxy Main: Mexican tetra (Astyanax mexicanus) The genome is available at NCBI : https://www.ncbi.nlm.nih.gov/genome/?term=astyanax+mexicanus and the annotation too: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Astyanax_mexicanus/102/

Please note that the genome is version 2.0 and made from the surface eco-morphotype (unlike the previous version 1.02 from cave eco-morphotype).

In the meantime, I am working with a custom genome.

Best, Julien

hexylena commented 4 years ago

Migrate to usegalaxy-playbook?

jennaj commented 4 years ago

Request:

Genome: Citrus sinensis v1.1

Source: https://www.citrusgenomedb.org/bio_data/79

jennaj commented 4 years ago

Request:

Human herpesvirus 1 with ref accession number NC_001806

jennaj commented 4 years ago

Request:

Genome: Tribolium castaneum genome assembly (Tcas5.2)

Source: https://www.ncbi.nlm.nih.gov/genome?term=tribolium%20castaneum

jennaj commented 4 years ago

Request:

Dada tools: https://github.com/galaxyproject/usegalaxy-playbook/issues/273

psyi commented 4 years ago

Dear Galaxy Team,

It would be great if the genome and annotation release of Physcomitrella patens can be added to Galaxy Main.

They are available at NCBI: https://www.ncbi.nlm.nih.gov/genome/383 https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Physcomitrella_patens/100

Best, Peishan

jennaj commented 4 years ago

@echoyps & everyone else with genome requests:

We will now be adding new genomes over the next few months and throughout the upcoming year. Please continue to post requests here. UCSC and NCBI are the preferred data sources. Others are possible. However requested, be specific.

Reminders:

Anyone can use genome (or transcriptome/exome) fasta data as a custom genome "from the history" now -- you do not need to wait for us to index server-side. Annotation is supplied by the end-user from the history by default (even for built-in indexed genomes) -- with just a few tool exceptions, but those also accept annotation data from the history. Genomes (fasta) are the data that is currently indexed server-side. Annotation may be indexed in the future. Custom genomes (fasta) can be promoted to a custom build (User > Custom Builds) in order to create a custom "database" metadata key that can be assigned to datasets (some tools wrapped for Galaxy require that the "database" is assigned to inputs).

Be sure to format the genome fasta correctly (remove description content on the ">" title line) and make sure the genome build/version and chromosome identifiers are an exact match between the custom reference genome (fasta) and any reference annotation (gtf or gff3) you plan to use in your analysis, before starting any analysis that uses it or promoting the fasta to a custom build. This will avoid problems later on. If there is a formatting problem (example: headers on a gtf dataset) or chromosome mismatch issue between inputs, this usually requires the need to fix the fasta format and start the analysis over from the very start, which can be frustrating. If you have a choice about annotation formats, choose the gtf version instead of the gff3 version -- a gtf formatted annotation dataset is accepted by more tools, and using the same exact annotation data throughout an analysis workflow is very important.

Mapping jobs will usually not "fail" due to chromosome identifier mismatch issues. Instead, if the annotation is input during the mapping step, the annotation will not really be used, creating problematic scientific results that may not be obvious to detect. Tools used downstream with a mismatched genome+annotation can also produce problematic scientific results that are not obvious, or may fail outright with errors that are difficult to interpret. Problematic annotation formatting itself will also lead to problems. Try to avoid issues by preparing your inputs correctly at the start :)

Finally, when loading these data with the Upload tool, allow the datatype to be detected instead of assigning it. This triggers basic format checks and a Galaxy-assigned datatype. If you do not get the expected datatype assigned, this almost always means that there is a formatting issue that needs to be addressed. Most format issues can be resolved within Galaxy. After fixed, the correct datatype can be assigned: Click on the dataset's pencil icon > Edit Attributes forms > Datatypes tab > "detect datatype" (best choice) or directly assigned (be careful if choosing this option). If Galaxy cannot "detect" the format correctly, there is likely still a data content or format problem.

If you ever have a problem that you cannot figure out how to resolve, know that the vast majority of tool errors or unexpected results are due to input issues that can be fixed to achieve a successful and correct scientific/technical result. First, review the tool form help -- most have examples of the expected input's content and format. Next, review our Troubleshooting and other FAQs. If those do not resolve the issue, the Galaxy Help forum is a great place to review prior Q&A or to ask a novel question. The Galaxy Training Network (GTN) tutorials are also a very useful resource -- compare your methods to the examples.

I'm only posting this advice here now since it hasn't been covered for a while at Github, and there are newer related FAQs plus prior Q&A available. Any followup/clarification should be asked about at Galaxy Help (not here).

The FAQs/links below will help with all of the above.

All FAQs: https://galaxyproject.org/support/

Start with these to learn how to use a custom genome and the associated annotation:

Error or unexpected result FAQ:

Galaxy Help forum:

GTN Tutorials:

Thanks! Jen

dram26 commented 3 years ago

Hi!

could you kindly add macaca fascicularis genome for BWA ? it's now at ncbi https://www.ncbi.nlm.nih.gov/genome/776

[There is a 2015 petition for the same here https://trello.com/c/mJWnAuuQ/1511-reference-genome-requests-for-http-usegalaxyorgby AmyK Feb 4, 2015 at 6:47 PM and another in 2016, but i guess the source wasn't validated then? ]

Best! David