databio / bedbase

Aggregate, analyze, and serve genomic regions.
http://bedbase.org/
4 stars 0 forks source link

Identifying extra columns in BED files #54

Closed jwokaty closed 1 month ago

jwokaty commented 9 months ago

Hi,

I'm creating an R client for api.bedbase.org at https://github.com/jwokaty/BEDbaseR. I want to import the BED files into GRanges objects; however, I noticed that the BED files have a varying number of extra columns. Is there anyway for me to know from the API the what these columns are?

Also, when I look at bed/example, I see

{
  "genome": {
    "alias": "hg38",
    "digest": ""
  },
  "expected_partitions": {
    "path": "output/bedstat_output/421d2128e183424fcc6a74269bae7934/GSM6856752_S1_H3K27ac_peaks_expected_partitions.pdf",
    "title": "Expected distribution over genomic partitions",
    "thumbnail_path": "output/bedstat_output/421d2128e183424fcc6a74269bae7934/GSM6856752_S1_H3K27ac_peaks_expected_partitions.png"
  },
  "gc_content": null,
  "fiveutr_frequency": 2925,
  "intron_percentage": 0.4246,
  "pipestat_modified_time": "2023-10-19T19:15:01.945492",
  "cumulative_partitions": {
    "path": "output/bedstat_output/421d2128e183424fcc6a74269bae7934/GSM6856752_S1_H3K27ac_peaks_cumulative_partitions.pdf",
    "title": "Cumulative distribution over genomic partitions",
    "thumbnail_path": "output/bedstat_output/421d2128e183424fcc6a74269bae7934/GSM6856752_S1_H3K27ac_peaks_cumulative_partitions.png"
  },
...

Are files such as output/bedstat_output/421d2128e183424fcc6a74269bae7934/GSM6856752_S1_H3K27ac_peaks_expected_partitions.pdf available somewhere? I tried https://api.bedbase.org/output/bedstat_output/421d2128e183424fcc6a74269bae7934/GSM6856752_S1_H3K27ac_peaks_expected_partitions.pdf but get {"detail":"Not Found"}. This is more of a curiosity at this point as I am mostly interested in importing into a GRanges object as I am still trying to understand the API.

Thanks for your help.

nsheff commented 9 months ago

BED files have a varying number of extra columns. Is there anyway for me to know from the API the what these columns are?

No, the API doesn't know that. Is this important? Do you suggest we change something here? Why are you interested in knowing the columns?

In reality, I suppose we may not even know the column, depending on where the BED file came from... but as of right now we're not tracking that. We could work on that, though.

Are files such as output/bedstat_output/421d2128e183424fcc6a74269bae7934/GSM6856752_S1_H3K27ac_peaks_expected_partitions.pdf available somewhere?

Yes, the files are served on a separate S3-compatible server. To find the URls for them, you use the DRS endpoints. That's described here: https://api.bedbase.org/docs/guide

To show you specifically for this example, here's how to do it: In that example you'll see the identifier for that BED record: "record_identifier": "421d2128e183424fcc6a74269bae7934"

You'll see that it has an object called expected_partitions. Use these to make an object identifier: bed.421d2128e183424fcc6a74269bae7934.expected_partitions

You can pass this to the DRS endpoints to get the object metadata:

https://api.bedbase.org/objects/bed.421d2128e183424fcc6a74269bae7934.expected_partitions

This has the URLs where you can get the object itself:

{
  "id": "bed.421d2128e183424fcc6a74269bae7934.expected_partitions",
  "name": null,
  "self_uri": "drs://api.bedbase.org/bed.421d2128e183424fcc6a74269bae7934.expected_partitions",
  "size": "unknown",
  "created_time": "2023-10-17T18:53:05.653831",
  "updated_time": "2023-10-19T19:15:01.945492",
  "checksums": "bed.421d2128e183424fcc6a74269bae7934.expected_partitions",
  "access_methods": [
    {
      "type": "http",
      "access_url": {
        "url": "https://data2.bedbase.org/output/bedstat_output/421d2128e183424fcc6a74269bae7934/GSM6856752_S1_H3K27ac_peaks_expected_partitions.pdf",
        "headers": null
      },
      "access_id": "http",
      "region": null
    },
    {
      "type": "s3",
      "access_url": {
        "url": "s3://data2.bedbase.org/output/bedstat_output/421d2128e183424fcc6a74269bae7934/GSM6856752_S1_H3K27ac_peaks_expected_partitions.pdf",
        "headers": null
      },
      "access_id": "s3",
      "region": null
    },
    {
      "type": "local",
      "access_url": {
        "url": "/static/output/bedstat_output/421d2128e183424fcc6a74269bae7934/GSM6856752_S1_H3K27ac_peaks_expected_partitions.pdf",
        "headers": null
      },
      "access_id": "local",
      "region": null
    }
  ],
  "description": null
}

You could also get these PDFs from the links on the splash page : https://dev.bedbase.org/bed/421d2128e183424fcc6a74269bae7934 (these will point to the same files)

jwokaty commented 9 months ago

Thanks for the explanation. I am still trying to understand the API as I develop the client. If the column information was available, I wanted to provide that to the user. I am not proposing any changes at this point.

jwokaty commented 5 months ago

I wanted to follow up on identifying the types of BED files. I see that there's been some development on api-dev.bedbase.org. Should I be developing my client against your development version?

khoroshevskyi commented 5 months ago

Hi, yes, I rewrote bedbase API and divided endpoints into statistics, classification, files, plots, and raw metadata. All of these fields will be developed further. All endpoints now have schemas so it should be easier to understand. Additionally, I would appreciate your feedback about the new API, what do you think should be added or changed.

khoroshevskyi commented 1 month ago

I think this issue was solved, as we added information about bed format and type, that can be found in metadata endpoint. ( e.g. https://api.bedbase.org/v1/bed/bbad85f21962bb8d972444f7f9a3a932/metadata/classification ) and on our ui page: image

If you think this issue is not solved please reopen it

nsheff commented 1 month ago

@ jwokaty let us know if there's anything else you need here.