error for unfound taxa when supplying a file to goat taxon search

charlottewright commented 1 year ago

Hello!

When running goat-cli taxon search with a file as input I get an error for each query species which does not exist in goat. For species that do return a hit, the output is as expected. The error is:

@media(max-width:640px){body{font-size:10pt}.longfoot aside, .longfoot footer p, aside,footer p{float:none;width:100%;padding:0.5em}}

 </style>
</head>
<body>
<h1><img src="/errors/sanger_289x86.png" alt="The Wellcome Sanger Institute" /></h1>
<div class="msg"> 
<p>
 The Sanger Institute Web service you requested is temporarily unavailable. 
 We are working to diagnose the problem and restore the service as soon as
 possible. We apologise for any inconvenience caused.
</p>
</div>
<footer>
 <aside>
  Further information about this outage may be found on the <a href="http://wtsi-status.blogspot.co.uk/">Sanger status blog</a>.
 </aside>
 <p>
  The Sanger Institute is a genome research institute primarily funded by
  <a href="https://wellcome.ac.uk/">Wellcome</a>. We use large-scale sequencing,
  informatics and analysis of genetic variation to further our understanding of gene function
  in health and disease and to generate data and resources of lasting value to biomedical research.
 </p>
</footer>

</body>
</html>

When running such species individually with -t flag, I get just the header output with no hits as expected so it seems the error is related to using a file with many queries.

The exact command I run is: goat-cli taxon search -f $i --assembly --country-list --karyotype

Thanks! :)

Charlotte

rjchallis commented 1 year ago

This is related to the connection timing out after several minutes. breaking the list up into smaller files should be an effective workaround for now to save having to run them one at a time.

The current implementation stops searching if the connection breaks to save the API having to process abandoned queries, but I'll need to implement some way to save the results to be collected when the query has finished in genomehubs/genomehubs to make it possible to fetch large results from the CLI, API and UI.

Euphrasiologist commented 1 year ago

@rjchallis do think this would be solved in part if the requests were made serially instead of concurrently?

Would make the progress bar issue (#22) easier to solve too.

Euphrasiologist commented 4 months ago

I've been thinking about this one. We can make this much more efficient by combining multiple taxa into the same search. @rjchallis is there any constraint on number of taxa in URL requests apart from URL length (< 2048 chars)? Not sure how to do this yet, but we could check number of taxa on CLI/in a file and then make fewer requests but with more taxa per request.

I'll see how to implement this.

genomehubs / goat-cli

error for unfound taxa when supplying a file to goat taxon search #23