galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.38k stars 992 forks source link

Get data from Biomart (ensemble) does not work. #5302

Closed rhpvorderman closed 6 years ago

rhpvorderman commented 6 years ago

When downloading from the Human genes (GRCh38.p10) dataset the following error eccors:

The uploaded file contains inappropriate HTML content.

This happens on the dutch public galaxy server (http://galaxy.nbic.nl) but also on usegalaxy.org.

Galaxy Tool ID: biomart
Galaxy Tool Version: 1.0.1
Tool Version:  
Tool Standard Output: stdout
Tool Standard Error: stderr
Tool Exit Code: 0
History Content API ID: bbd44e69cb8906b560c59f07b23d11dd
Job API ID: bbd44e69cb8906b5fbc9933451e77120
History API ID: 11f5813bcca40d2d
UUID: 2c1edee2-cd81-42ea-b987-2916b611695f
bernt-matthias commented 6 years ago

same here on our private instance and I also checked the freiburg Galaxy server (to check) .. ping @bgruening

I remember that I used the tool successfully last year in October for the GTN reference based transcriptomics tutorial to get a mapping from transcript IDs to GO. So either ensemble changed something or some new feature in galaxy broke the functionality.

I also often get The page isn’t redirecting properly from my browser when I start the tool again after the failure. I guess this is a problem with the session management at the ensembl web site.

Any idea who could be asked?

jennaj commented 6 years ago

ping @davebx @natefoo

I can reproduce this at Galaxy Main https://usegalaxy.org. The issue might be at least a year old. (Perhaps introduced when they switched the URL https://github.com/galaxyproject/galaxy/issues/2148).

The "Send to Galaxy" option seems to be one part of the problem (is serving the content within HTML). The site also seems to be having trouble serving large datasets in general (file download). Testing the option for compressed URL archive (via email) to see what content the result has and if that can be used in the Upload tool directly (as a workaround for now).

Test history, each query described in the comments: https://usegalaxy.org/u/jen/h/test-history-biomart-send-to-galaxy

$ vi "mart_export (1).txt"
   1 <html lang="en-gb">
   2 
   3 <head>
   4   <title></title>
   5 
   6 <!--[if lte IE 7]><link rel="stylesheet" type="text/css" media="all" href="/minified/16a1998e9bf5965483fbb910105fe878.ie     7.css"/><![endif]--><!--[if gt IE 7]><link rel="stylesheet" type="text/css" media="all" href="/minified/16a1998e9bf59654     83fbb910105fe878.css"/><![endif]--><!--[if !IE]><!--><link rel="stylesheet" type="text/css" media="all" href="/minified/     16a1998e9bf5965483fbb910105fe878.css"/><!--<![endif]--><!--[if lte IE 7]><link rel="stylesheet" type="text/css" media="a     ll" href="/minified/4e4f8f272a6f10933467c6dc9964d6a3.image.ie7.css"/><![endif]--><!--[if gt IE 7]><link rel="stylesheet"      type="text/css" media="all" href="/minified/4e4f8f272a6f10933467c6dc9964d6a3.image.css"/><![endif]--><!--[if !IE]><!-->     <link rel="stylesheet" type="text/css" media="all" href="/minified/4e4f8f272a6f10933467c6dc9964d6a3.image.css"/><!--<![e     ndif]--><!--[if lte IE 7]><link rel="stylesheet" type="text/css" media="all" href="/biomart/mview/martview.ie7.css"/><![     endif]--><!--[if gt IE 7]><link rel="stylesheet" type="text/css" media="all" href="/biomart/mview/martview.css"/><![endi     f]--><!--[if !IE]><!--><link rel="stylesheet" type="text/css" media="all" href="/biomart/mview/martview.css"/><!--<![end     if]--><!--[if lte IE 7]><link rel="stylesheet" type="text/css" media="all" href="/martview-hacks.ie7.css"/><![endif]--><     !--[if gt IE 7]><link rel="stylesheet" type="text/css" media="all" href="/martview-hacks.css"/><![endif]--><!--[if !IE]>     <!--><link rel="stylesheet" type="text/css" media="all" href="/martview-hacks.css"/><!--<![endif]-->
   7  <script type="text/javascript" src="/biomart/mview/js/martview.js"></script>
   8 
   9   <link type="image/png" href="/i/ensembl-favicon.png" rel="icon" />
  10   <link href="/apple-touch-icon.png" rel="apple-touch-icon" type="image/png" />
  11   <link type="application/opensearchdescription+xml" title="Ensembl (All)" href="//uswest.ensembl.org/opensearch/all.xml     " rel="search" />
  12   <link rel="alternate" href="/common/rss.xml" title="Ensembl website news feed" type="application/rss+xml" />
  13 
  14 <meta name="viewport" content="target-densitydpi=device-dpi, width=device-width, initial-scale=1.0, maximum-scale=2.0, u     ser-scalable=yes" />
  15 <script>var prefetch = ['/minified/7564e6faf5b7e6fb272f11f7b10e5b1f.jpg','/minified/d0273afd2dc00ba835564c91094f59d0.png     ','/minified/0bcac77a199e8890163ba2146e4e6694.png','/minified/b93fc099787382c264b2ac1433751bc3.png','/minified/44c677237     386fc83d8299fa660a0a7f3.png','/minified/e1193cce40877555f2a84453b7c52cbe.png','/minified/c1eccc028785d6ef8b6afabb1c1676d     e.png'];</script>
  16 </head>
  17 <body class="mac no_tabs static" id="ensembl-webpage">
  18   <div id="min_width_container">
  19     <div id="min_width_holder">
  20       <div id="masthead" class="js_panel">
  21         <input type="hidden" class="panel_type" value="Masthead" />
  22         <div class="logo_holder"><a href="/"><div class="logo-header print_hide" title="Ensembl mirror">&nbsp;</div></a>     <span class="mobile-only species-header">Ensembl</span><img src="/i/e-ensembl_print.gif" alt="Ensembl mirror" title="Ens     embl mirror" class="screen_hide_inline" style="width:170px;height:45px" /></div>
  23         <div class="mh print_hide">
  24           <div class="account_holder"><div class="_account_holder"><div class="account-loading">Loading&hellip;</div><fo     rm action="/Ajax/accounts_dropdown"></form></div></div>
  25           <div class="tools_holder"><ul class="tools"><li><a class="constant" href="/Multi/Tools/Blast?db=core">BLAST/BL     AT</a></li><li><a class="constant" href="/biomart/martview">BioMart</a></li><li><a class="constant" href="/info/docs/too     ls/index.html">Tools</a></li><li><a class="constant" href="/downloads.html">Downloads</a></li><li><a class="constant" hr     ef="/info/">Help &amp; Documentation</a></li><li class="last"><a class="constant" href="http://www.ensembl.info">Blog</a     ></li></ul><div class="more"><a href="#">More <span class="arrow">&#9660;</span></a></div></div>
  26           <div class="search_holder print_hide">
  27     <div id="searchPanel" class="js_panel">
  28       <input type="hidden" class="panel_type" value="SearchBox" />
  29       <form action="/Multi/Psychic">
  30         <div class="search print_hide">
  31           <div class="sites button">
  32             <img class="search_image no-sprite" src="/i/search/ensembl.gif" alt="" />
  33             <img src="/i/search/down.gif" style="width:7px" alt="" />
  34             <input type="hidden" name="site" value="ensembl_all" />
  35           </div>
  36           <div>
  37             <label class="hidden" for="se_q">Search terms</label>
  38             <input class="query inactive" id="se_q" type="text" name="q" value="Search all species&hellip;" data-role="n     one" onkeydown="if (event.keyCode == 13) { $(this).closest('form').submit(); return false; }" />
  39           </div>
  40           <div class="button"><img src="/i/16/search.png" alt="Search&nbsp;&raquo;" onClick="$(this).closest('form').sub     mit()" /></div>
  41         </div>
  42         <div class="site_menu hidden">
  43           <div class="ensembl_all"><img class="no-sprite" src="/i/search/ensembl.gif" alt="Search all species"/>Search a     ll species<input type="hidden" value="Search all species&hellip;" /></div>
  44 <div class="ensembl_genomes"><img class="no-sprite" src="/i/search/ensembl_genomes.gif" alt="Search Ensembl genomes"/>Se     arch Ensembl genomes<input type="hidden" value="Search Ensembl genomes&hellip;" /></div>
  45 <div class="vega"><img class="no-sprite" src="/i/search/vega.gif" alt="Search Vega"/>Search Vega<input type="hidden" val     ue="Search Vega&hellip;" /></div>
  46 <div class="ebi"><img class="no-sprite" src="/i/search/ebi.gif" alt="Search EMBL-EBI"/>Search EMBL-EBI<input type="hidde     n" value="Search EMBL-EBI&hellip;" /></div>
  47 <div class="sanger"><img class="no-sprite" src="/i/search/sanger.gif" alt="Search Sanger"/>Search Sanger<input type="hid     den" value="Search Sanger&hellip;" /></div>
  48 
  49         </div>
  50       </form>
  51     </div>
  52   </div>
  53         </div>
  54       </div>
  55       <div id="main_holder">
  56         <div id="main">
  57   fasta____fasta;FASTA____fasta____fasta;FASTA____<pre class="mart_results">>ENSMUSG00000020333|ENSMUST00000000145
  58 AGAACGTTGCGGGGCGGGCGGCCCAGCCCCTCCCCCAGTCGGGCTCGGCAGTTCGGATGC
  59 CGCTGTCTCTTTGCCCAGGAGTCCCGGCGCGCTGCGGGGCTGGGAGTCGGGTTCCGTGAG
  60 GAGCGCGCGCTGCGCCCTCCCCCTCCCGCCGGGTCTCCGCAGCGGCGCGGGGAGGCGGGG
  61 GCTAAAAATACCCGGCGGCGGCGGCAGCGGCGGTGGCTCTGGGGCTGCGGGGCTGCGGGC

and

$ tail "mart_export (1).txt"
    <div id="modal_default" class="modal_content js_panel fixed_width" style="display:none"></div>
    <div class="modal_overlay"><img class="overlay_close" title="Cancel" alt="close" src="/i/close.png" /><div class="overlay_content"></div></div>
    <div class="modal_overlay_bg"></div>
  </div>

  <script type="text/javascript" src="/minified/73d095f890280977fe9f1483652610a2.js"></script>
<script type="text/javascript" src="/minified/b8ceeecb7ba4c63160d3cd0b0fe965e4.js"></script>
<script type="text/javascript">addLoadEvent(setVisibleStatus)</script>
</body>
</html>

Also reported in a few bug reports (from Galaxy Main) and at Galaxy Biostars: https://biostar.usegalaxy.org/p/26792/#26800

drhoads commented 6 years ago

I was the one that started this case file. I was able to download the data I needed directly from the European Biomart but never from the USEast mirror. I opened a case with Ensembl and their response was that Biomart was never supposed to be able to deliver a complete download of all the transcripts for a genome (like chicken or human). They want you to download from their FTP site if you want all the cDNA sequences. IMHO Galaxy just needs to add text to the help for the Get Data call that says you can download smaller datasets (table of transcript IDs) but not the fasta sequences. Here is the ENSEMBL response on my case:

From: Benjamin Moore via RT [mailto:helpdesk@ensembl.org] Sent: Tuesday, February 13, 2018 3:53 AM To: Douglas Duane Rhoads drhoads@uark.edu Subject: [Ensembl #244017] Network error - useast.ensembl.org

Hi Douglas,

Thank you for contacting the Ensembl Helpdesk.

You are encountering this error because the dataset you are trying to retrieve is too large for BioMart. BioMart is designed to retrieve data for small/medium sized datasets. For example, for sequence data, we typically advise around 500 genes for query.

To retrieve the cDNA sequence of all chicken genes, you can download the file directly from our FTP site. There is information about all of the available files in the following documentation page: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ensembl.org_info_data_ftp_index.html&d=DwIDaQ&c=7ypwAowFJ8v-mw8AB-SdSueVQgSDL4HiiSaLK01W8HA&r=u5psgUb-1_IPxoNHBeTMVw&m=vwZmntFZdZXFm_Kw27bLEI-praOZraOSTu7R66i6OX4&s=33Q-QCTn1Do_QVJLZ0-5NEDRewW0Q7Vsw5BA2CBJi-s&e=

You can use the table on this page to search for chicken and click 'cDNA FASTA' to jump to the FTP directory containing the FASTA file with the cDNA sequences of all chicken genes. This is the file you will need to download: https://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.ensembl.org_pub_release-2D91_fasta_gallus-5Fgallus_cdna__Gallus-5Fgallus.Gallus-5Fgallus-2D5.0.cdna.all.fa.gz&d=DwIDaQ&c=7ypwAowFJ8v-mw8AB-SdSueVQgSDL4HiiSaLK01W8HA&r=u5psgUb-1_IPxoNHBeTMVw&m=vwZmntFZdZXFm_Kw27bLEI-praOZraOSTu7R66i6OX4&s=WONxBP-8hYbjX0JhhzZAs93aNrYOIxnGHB3dcgd4Xwg&e=

You can read more about the available files in the README: https://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.ensembl.org_pub_release-2D91_fasta_gallus-5Fgallus_cdna__README&d=DwIDaQ&c=7ypwAowFJ8v-mw8AB-SdSueVQgSDL4HiiSaLK01W8HA&r=u5psgUb-1_IPxoNHBeTMVw&m=vwZmntFZdZXFm_Kw27bLEI-praOZraOSTu7R66i6OX4&s=iqKJrj1scsa8IR6vlweAIpu7IzhhnD0IGmw2dUAqPOo&e=

I hope this helps you retrieve the data you need but please do get back in touch if you have any further questions.

Best wishes

Ben Ensembl Helpdesk

jennaj commented 6 years ago

@drhoads thanks for posting this back here!

Galaxy devs - the tool goes directly to the Biomart website where the help is not very clear about large file downloads/limits. The query is set and "Send to Galaxy" used.

Perhaps the tool should be redesigned. Adding help directly in the Galaxy UI would be tricky since the query is made directly at the Biomart web portal. We could add a section to our FAQ (https://galaxyproject.org/support/loading-data/) but that would not help users directly at tool execution, only after a problem and they look or ask for help.

Other Get Data tool's behavior for comparison:

Thoughts?

drhoads commented 6 years ago

I suggest that when users click on the Get Data>Biomart tool there was a splash screen that warned about big file downloads terminating and that for big data they should FTP from the Biomart site and then upload. then there could still be an "execute" button that they click on for proceeding to Biomart to get data.

natefoo commented 6 years ago

We've decided to remove the tool from Galaxy Main. As @drhoads suggests, you can fetch data from the FTP site. You can also do this without the intermediate step by pasting the ftp:// URL to the data in Galaxy's URL paste upload option.

bernt-matthias commented 6 years ago

Hrm. Really liked the (functional) tool. Some of the data (everything except sequences and annotations) seems to be difficult to get from the ftp.

drhoads commented 6 years ago

I think the problem comes in when Biomart can’t deliver the bigger datasets. In the absence of warnings about dataset size the tool chokes. We can still get data out of biomart but you download locally and then upload. For the bigger datasets you have to go to the ENSEMBL ftp site.

From: M Bernt [mailto:notifications@github.com] Sent: Friday, February 23, 2018 2:29 AM To: galaxyproject/galaxy galaxy@noreply.github.com Cc: Douglas Duane Rhoads drhoads@uark.edu; Mention mention@noreply.github.com Subject: Re: [galaxyproject/galaxy] Get data from Biomart (ensemble) does not work. (#5302)

Hrm. Really liked the (functional) tool. Some of the data (everything except sequences and annotations) seems to be difficult to get from the ftp.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_galaxyproject_galaxy_issues_5302-23issuecomment-2D367942874&d=DwMCaQ&c=7ypwAowFJ8v-mw8AB-SdSueVQgSDL4HiiSaLK01W8HA&r=u5psgUb-1_IPxoNHBeTMVw&m=a_2jUmUUz8AlBFN9roHhYQ8D3RuFt887HTxnYRTrswU&s=EfMllFq3EKzm2f9OlEq-PZWOFw5WAcPlKv0H4Rrz_jU&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_Ai-2DtVoLywjdJvA7xJ53-2DRqYf5mtdwuySks5tXnbQgaJpZM4Rb-5Flq&d=DwMCaQ&c=7ypwAowFJ8v-mw8AB-SdSueVQgSDL4HiiSaLK01W8HA&r=u5psgUb-1_IPxoNHBeTMVw&m=a_2jUmUUz8AlBFN9roHhYQ8D3RuFt887HTxnYRTrswU&s=UWVMKKjtzH6Em3V3zUbU6T0SuD-a455C31WlYynJgJQ&e=.

natefoo commented 6 years ago

I imagine that if anyone had some time to devote to updating the tool to produce warnings or be able to fetch from FTP directly and submit it for inclusion in the IUC tools we'd probably put it back on Main, but unfortunately we don't have the resources to do this internally at the moment.

Alternatively, one could request that Ensembl provide a Galaxy-compatible endpoint.

drhoads commented 6 years ago

When I contacted ENSEMBL they said that Biomart was never intended to deliver anything as big as all the cDNA fasta data for an organism and directed me to the ftp site. Therefore, I don’t think ENSEMBL is amenable to that latter option.

From: Nate Coraor [mailto:notifications@github.com] Sent: Friday, February 23, 2018 9:21 AM To: galaxyproject/galaxy galaxy@noreply.github.com Cc: Douglas Duane Rhoads drhoads@uark.edu; Mention mention@noreply.github.com Subject: Re: [galaxyproject/galaxy] Get data from Biomart (ensemble) does not work. (#5302)

I imagine that if anyone had some time to devote to updating the tool to produce warnings or be able to fetch from FTP directly and submit it for inclusion in the IUC tools we'd probably put it back on Main, but unfortunately we don't have the resources to do this internally at the moment.

Alternatively, one could request that Ensembl provide a Galaxy-compatible endpoint.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_galaxyproject_galaxy_issues_5302-23issuecomment-2D368038540&d=DwMFaQ&c=7ypwAowFJ8v-mw8AB-SdSueVQgSDL4HiiSaLK01W8HA&r=u5psgUb-1_IPxoNHBeTMVw&m=6wM_XKMW-A-YQSn6TjK2vpf5vnu_zsEp-50XIS_o3nw&s=EBsRk3XsWfn5-TlC1r_JDNhP2xy9mbgWr2QYlMUngrI&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_Ai-2DtVtVnln2-5Fa-5F51TYOpy2cTcZQ-2DdP1Sks5tXtdFgaJpZM4Rb-5Flq&d=DwMFaQ&c=7ypwAowFJ8v-mw8AB-SdSueVQgSDL4HiiSaLK01W8HA&r=u5psgUb-1_IPxoNHBeTMVw&m=6wM_XKMW-A-YQSn6TjK2vpf5vnu_zsEp-50XIS_o3nw&s=SijlYQddVZNgprcjegRJCxrhzr_SGne6hym8Z-JNAfE&e=.