databio / bedhost

API and UI for BEDbase
http://api.bedbase.org
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Query bigbed #48

Closed xuebingjie1990 closed 3 years ago

xuebingjie1990 commented 3 years ago

Region-based queries: Returns the queried regions with provided ID and optional query parameters (chr, start, end)

Right now it requires the query parameters. I'm still working on when the query parameters are missing:

  1. when start and end are not provided, it should return all regions of the provided chr
  2. when no query parameters are provided, it should return all regions

current output:

[["chr1",999812,1001072],["chr1",999812,1001072],["chr1",1001535,1002148],["chr1",1001535,1002148],["chr1",1004516,1005152],["chr1",1005091,1005361],["chr1",1013101,1013583],["chr1",1020981,1021617],["chr1",1031117,1031753],["chr1",1031332,1031968],["chr1",1031944,1032580],["chr1",1033192,1033587],["chr1",1034085,1034350],["chr1",1034156,1034792],["chr1",1038683,1038985],["chr1",1050455,1051091],["chr1",1058927,1059563],["chr1",1059665,1059871],["chr1",1063983,1064725],["chr1",1063983,1064725],["chr1",1065676,1066312],["chr1",1069717,1069887],["chr1",1079496,1079910],["chr1",1079863,1080026],["chr1",1115849,1116785],["chr1",1115849,1116785],["chr1",1122235,1122448],["chr1",1137406,1137609],["chr1",1144077,1144713],["chr1",1157845,1158400],["chr1",1169544,1170180],["chr1",1171602,1172227],["chr1",1171602,1172169],["chr1",1201276,1201815],["chr1",1206279,1206915],["chr1",1209052,1209225],["chr1",1213057,1213297],["chr1",1213119,1213755],["chr1",1215076,1215530],["chr1",1215076,1215530],["chr1",1231637,1232725],["chr1",1249825,1250461],["chr1",1250835,1251471],["chr1",1251043,1251679],["chr1",1273436,1274045],["chr1",1273983,1274238],["chr1",1305108,1305744],["chr1",1307800,1308245],["chr1",1308324,1308492],["chr1",1308574,1308977],["chr1",1324236,1324872],["chr1",1348639,1348952],["chr1",1349496,1349998],["chr1",1371938,1372308],["chr1",1374907,1375687],["chr1",1374907,1375687],["chr1",1398998,1399625],["chr1",1398998,1399625],["chr1",1406974,1407769],["chr1",1406974,1407769],["chr1",1417759,1418395],["chr1",1422313,1422924],["chr1",1422313,1422860],["chr1",1435552,1435965],["chr1",1437979,1438306],["chr1",1440791,1441427],["chr1",1462446,1463082],["chr1",1471424,1472087],["chr1",1471424,1472087],["chr1",1505080,1505362],["chr1",1505340,1505575],["chr1",1511660,1512455],["chr1",1511660,1512455],["chr1",1574374,1575135],["chr1",1574374,1575135],["chr1",1597208,1597430],["chr1",1599493,1600021],["chr1",1615229,1616382],["chr1",1615229,1616382],["chr1",1629097,1629361],["chr1",1629125,1629761],["chr1",1630309,1630762],["chr1",1630309,1630707],["chr1",1630566,1631202],["chr1",1658976,1659290],["chr1",1677608,1678398],["chr1",1677608,1678398],["chr1",1692658,1692842],["chr1",1724181,1724643],["chr1",1746336,1746588],["chr1",1778071,1779443],["chr1",1778071,1779443],["chr1",1778071,1779443],["chr1",1780219,1780686],["chr1",1782767,1782957],["chr1",1799553,1800189],["chr1",1869265,1869901],["chr1",1889919,1890593],["chr1",1889919,1890593],["chr1",1890771,1891366],["chr1",1907331,1907967],["chr1",1908715,1909217],["chr1",1918799,1919435]]
nsheff commented 3 years ago

Right now it requires the query parameters. I'm still working on when the query parameters are missing: 1 when start and end are not provided, it should return all regions of the provided chr 2 when no query parameters are provided, it should return all regions

I think it's safe to just require the query parameters -- so, skip your point 2... if they want the whole file they can hit the whole file and get it on their own. This is explicitly for subsets.

But I think the chrom-only approach is probably reasonable and not too hard to implement.

nsheff commented 3 years ago

For the output, I think it would be more convenient to put it in tabular form.

This:

chr1 999812 1001072
chr1 999812 1001072

Not this:

[["chr1",999812,1001072],["chr1",999812,1001072]]

Or, maybe make that parameterizable, with the above as be default.

in fact, shouldn't it come out the above way straight out of the software? I think there's no need to load the results into python, just return them directly. it will save us memory if you can just stream those results.

nsheff commented 3 years ago

@xuebingjie1990 Can you provide an example of the API format as well?

xuebingjie1990 commented 3 years ago

@xuebingjie1990 Can you provide an example of the API format as well?

http://0.0.0.0:8000/api/bed/78c0e4753d04b238fc07e4ebe5a02984/regions/?chr=chr1&start=1000000&end=2000000

xuebingjie1990 commented 3 years ago

But I think the chrom-only approach is probably reasonable and not too hard to implement.

since what we have is the bigBed format, to get the all the regions of a given chrom, I think we need the chrom.sizes file to get the end coordinates. we can upload it to s3, or is there a way to get the content of the chrom.sizes file from refgenie?

another way is, instead of generate the bigBed files, maybe we should generate bigWig instead.

nsheff commented 3 years ago

since what we have is the bigBed format, to get the all the regions of a given chrom, I think we need the chrom.sizes file to get the end coordinates. we can upload it to s3, or is there a way to get the content of the chrom.sizes file from refgenie?

bedToBigBed can't just take chrom? If not I'd just say don't bother then. that does surprise me though.

another way is, instead of generate the bigBed files, maybe we should generate bigWig instead.

That's a different data type, so I don't see what you mean. I don't think this makes sense, bigWig files don't store interval data.

nsheff commented 3 years ago

@xuebingjie1990 Can you provide an example of the API format as well?

http://0.0.0.0:8000/api/bed/78c0e4753d04b238fc07e4ebe5a02984/regions/?chr=chr1&start=1000000&end=2000000

Is there a reason to use query params here? I'd suggest these should be path params.

xuebingjie1990 commented 3 years ago

Is there a reason to use query params here? I'd suggest these should be path params.

I'll change the chr to path param, and keep start and end as query params.

xuebingjie1990 commented 3 years ago

since what we have is the bigBed format, to get the all the regions of a given chrom, I think we need the chrom.sizes file to get the end coordinates. we can upload it to s3, or is there a way to get the content of the chrom.sizes file from refgenie?

bedToBigBed can't just take chrom? If not I'd just say don't bother then. that does surprise me though.

another way is, instead of generate the bigBed files, maybe we should generate bigWig instead.

That's a different data type, so I don't see what you mean. I don't think this makes sense, bigWig files don't store interval data.

I was talking about pyBigWig. i'll switch to bigBedToBed

nsheff commented 3 years ago

I was talking about pyBigWig. i'll switch to bigBedToBed

???

xuebingjie1990 commented 3 years ago

I was talking about pyBigWig. i'll switch to bigBedToBed

???

When query entries from bigBed files using pyBigWig, the chr, start, and end are all required.

Since you suggest using bigBedToBed, I'll use bigBedToBed instead of pyBigWig. But I don't know how can I just stream the result of bigBedToBed since it requires an output path for saving the results to.

nsheff commented 3 years ago

doesn't - or something stream to stdout?

nsheff commented 3 years ago

it's stdout. the ucsc tools generally let you use stdout to pint to stdout. So, its:

bigBedToBed file.bb stdout
nsheff commented 3 years ago

since what we have is the bigBed format, to get the all the regions of a given chrom, I think we need the chrom.sizes file to get the end coordinates. we can upload it to s3, or is there a way to get the content of the chrom.sizes file from refgenie?

bedToBigBed can't just take chrom? If not I'd just say don't bother then. that does surprise me though.

I just confirmed you can just provide chrom -- -chrom and -start and -end are all optional.

You can just use -chrom=chr1 and leave off start/end to get everything on 1 chromosome.

Maybe this argues for keeping these as query params as you had, since they are all optional. that's fine with me, I guess

nsheff commented 3 years ago

Here's how you could return the result:

https://fastapi.tiangolo.com/advanced/custom-response/#using-streamingresponse-with-file-like-objects

I think you will need to use something like asyncio to read the stdout of the subprocess and return the results asynchronously.

https://docs.python.org/3/library/asyncio-subprocess.html

I don't have direct experience doing this

xuebingjie1990 commented 3 years ago

bigBedToBed file.bb stdout

I figured out. It's working now. but the output has \t and \n in it. i'm trying to format that.

"chr1\t778543\t779076\t.\t1000\t.\t43.63674\t-1.0\t2.67082\t523\nchr1\t778543\t779076\t.\t1000\t.\t224.16478999999998\t-1.0\t4.66339\t319\nchr1\t804614\t805250\t.\t872\t.\t83.88604000000001\t-
......
nsheff commented 3 years ago

might want to also pipe it through | cut -f1-3 just to be clean.

I think for the formatting thing, might just be a header type issue; return as plain text or something

nsheff commented 3 years ago

alright if this is working, I'd say release it to the dev server so we can see it action. this will also fix the redirect issue, and let you test the new track hubs

xuebingjie1990 commented 3 years ago

alright if this is working, I'd say release it to the dev server so we can see it action. this will also fix the redirect issue, and let you test the new track hubs

yes it's working. I tested it again locally. I'll merge it to dev now