GMOD / jbrowse-components

Source code for JBrowse 2, a modern React-based genome browser
https://jbrowse.org/jb2
Apache License 2.0
205 stars 62 forks source link

make a central data proxy #288

Open rbuels opened 5 years ago

rbuels commented 5 years ago

Let's make a central REST API running at https://jbrow.se/proxy that all JBrowses can use to sidestep CORS incompatibilities and to access data that is only available over FTP.

Implementation Notes

Architecture

Version 1 (the stupid thing/MVP)

Version 2 (the scalable thing)

API draft

GET /v1/{url}

Returns data exactly as if the browser had requested that file from the given URL. The url does not even have to be escaped additionally at all, https://jbrow.se/proxy/v1/http://terrible.org/some/file/someplace.bw is valid.

rbuels commented 5 years ago

Second part of this is: make error handlers in JBrowse HTTP request code emit a warning and retry the request through the proxy if there seem to be CORS problems.

We would like to complain about this to the user, so that people can fix their server CORS configurations, but people can still get to the data in the meantime.

rbuels commented 5 years ago

It might even be possible for the central proxy to send automated (but not obviously automated) emails to server administrators bugging them to fix cors.

cmdcolin commented 5 years ago

https://daniel.haxx.se/blog/2010/12/20/byte-ranges-for-ftp/

rbuels commented 5 years ago

Let's initially try to do the hacked-up FTP range request scheme described in that link, and see how it goes.

rbuels commented 5 years ago

can have a look at what UCSC does for ftp stuff: https://github.com/ucscGenomeBrowser/kent/blob/master/src/htslib/knetfile.c

cmdcolin commented 4 years ago

I wanted to test a file without transfering it to a cors-haven so I tried cors-anywhere open proxy and self hosting a cors-anywhere instance myself and in some cases, it can work, some cases it will fail, and reasons remain unknown Before engineering a large system around this, I think it's helpful probably to understand where these failure cases can arise

<html>
  <script>
    (async () => {
      const res = await fetch('https://cors-anywhere.herokuapp.com/http://jbrowse.org/code/JBrowse-1.16.6/docs/tutorial/data_files/volvox-sorted.bam', {headers: {range: 'bytes=0-10'}})

      const t = await res.arrayBuffer()
      console.log('volvox-arraybuffer',t)

      const res2 = await fetch('https://cors-anywhere.herokuapp.com/https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb/alignment/HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.bam',{headers: {range: 'bytes=0-10'}})
      console.log('pacbio-ncbi-corsanywhere request',res2)
      const t2 = await res2.arrayBuffer()

      console.log('pacbio-ncbi-corsanywhere arraybuffer',t2)

    })()
</script><body>Hello</body></html>

Resulting output

volvox-arraybuffer ArrayBuffer(11) {}  <-- works for the volvox file

HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.bam:1 GET https://cors-anywhere.herokuapp.com/https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb/alignment/HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.bam net::ERR_CONTENT_DECODING_FAILED

test.html:16 Uncaught (in promise) TypeError: Failed to fetch
async function (async)
(anonymous) @ test.html:4
(anonymous) @ test.html:16
rbuels commented 4 years ago

I bet the CONTENT_DECODING_FAILED thing is caused by something in cors-anywhere incorrectly detecting the bam file as gzipped and trying to decompress it somewhere along the way.

cmdcolin commented 4 years ago

This CORS issues will be worse with jbrowse being hosted on https because it cannot access http resources which many trackhubs use.

cmdcolin commented 3 years ago

Random tidbit

For some files cors-anywhere works fine

E.g.

cors disabled on this s3 bucket, so this fails https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/alignments/chm13.draft_v1.0.hifi.bam

adding cors-anywhere proxy makes it work https://cors-anywhere.herokuapp.com/https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/alignments/chm13.draft_v1.0.hifi.bam

cmdcolin commented 3 years ago

of course throwing a lot of big data through that is somewhat abusive to their free service but their thing is open source, can be rehosted

cmdcolin commented 3 years ago

cors-proxy appears to be failing for something where it worked before, namely ucsc api

this url currently produces application error

https://cors-anywhere.herokuapp.com/http://api.genome.ucsc.edu//getData/track?genome=hg19;track=geneHancerInteractionsDoubleElite;chrom=chr1;start=39482020;end=39484868

Screenshot from 2021-01-14 08-58-54

scottcain commented 10 months ago

I was hoping to test out cors-anywhere using the same host (https://cors-anywhere.herokuapp.com/) and it sort of works and I have more information. The first item is that the developer running this server has set it up such the it will only temporarily unlock if you visit the site before and push a button to request access (basically, to allow testing as a dev). I did that and then tried to fetch some NCList data:

https://cors-anywhere.herokuapp.com/http://jbrowse.informatics.jax.org/data/mouse/tracks/MGI_Genome_Features/{refseq}/trackData.json

But it fails. I tested fetching this with curl after adding required headers and supplying a MGI chromosome name (Alliance uses "1" whereas MGI uses "chr1"):

curl -H 'X-Requested-With: XMLHttpRequest' -O https://cors-anywhere.herokuapp.com/http://jbrowse.informatics.jax.org/data/mouse/tracks/MGI_Genome_Features/chr1/trackData.json

which is successful. I double checked that there is an alias file for mouse chromosomes but it still fails. If I hard code the "chr" to the url though, it works:

"uri": "https://cors-anywhere.herokuapp.com/http://jbrowse.informatics.jax.org/data/mouse/tracks/MGI_Genome_Features/chr{refseq}/trackData.json"

The reason for having to do this probably has to do with JBrowse retrying but getting rejected by the proxy. Also, since this is using the public proxy, it won't work generally but does work on my temporarily whitelisted computer.

This, at least, seems like enough to trying getting a server of my own going that uses https. Note that the MGI JBrowse instance uses http, so even if they enabled cors, I wouldn't be able to use it because it would through a security error (since we use https).