databio / bbconf

Configuration package for bedbase project
https://pypi.org/project/bbconf/
BSD 2-Clause "Simplified" License
1 stars 2 forks source link

name and sample_name in the schemas #23

Closed nsheff closed 8 months ago

nsheff commented 9 months ago

It appears that the bedhost-ui was expecting an attribute called 'name'.

also, the bedfile_schema.yaml says there's an attribute called 'name'.

pipestat is ingesting this schema...but then, in the actual items retrieved from pipestat, there is no 'name', instead there is sample_name.

So, this is causing a lot of problems.

donaldcampbelljr commented 9 months ago

In the refactored version of pipestat (current dev branch), pipestat now uses record_identifier instead of sample_name or project_name.

I'm a little confused that name is in the output_schema. I had assumed it would be synonymous with record_identifier when reporting results via pipestat. What is the actual record_identifier for these bedfiles? Previously sample_name was a reserved keyword that was not allowed in the output_schema which is probably why name was used instead.

Do you have an example error?

donaldcampbelljr commented 9 months ago

Currently, in the database, there are two columns, name and sample_name. name should be the result_identifier and sample_name is the record identifier.

Next steps:

donaldcampbelljr commented 9 months ago

I looked into this with a local database where I reported results using bbconf and then attempted to use the bedhost API to retrieve results.

I am able to report and retrieve results using only bbconf just fine. However, I encounter issues when using the bedhost API.

When bedhost uses bbconf to retrieve using pipestat, many of the endpoints use md5sum as the record identifier. Unfortunately, md5sum is actually a result_identifier per the output_schema.

Example: In bed_api.py Endpoint @router.get("/bed/{md5sum}/metadata", response_model=DBResponse)

    try:
        values = bbc.bed.retrieve(md5sum, attr_ids)

This should actually be:

    try:
        values = bbc.bed.retrieve(record_identifier="bedfile_identifier", result_identifier=md5sum)

Of course, I believe we want the record_identifier=md5sum

Question: Do we want to remove the md5sum result_identifier and have the digest live in the record_identifier? Or should we put the digest in both columns?

nsheff commented 9 months ago

I recall some confusion over this when I was running things as well. But wait: I think you're missing something here. Your second example is not doing what the first example was intended to do, is it? If so, that's a really confusing API... How will you specify which record to use? It's not the string "bedfile_identifier"...

I think the point is that the record identifier should be the md5sum. At least, that's what makes sense to me.

If the record identifier isn't currently the md5sum... then what is the record identifier?

donaldcampbelljr commented 9 months ago

"bedfile_identifier" was just a generic example as to how pipestat.retrieve currently works.

The record_identifier/sample_name is whatever string is currently in sample_name column within the production database.

donaldcampbelljr commented 9 months ago

Just another option:

We can also use pipestat.backend.select to select the row based on the md5sum, e.g.

bbc.bed.backend.select(filter_conditions=[("md5sum", "eq", "eeedeb58d744610e3cdb62fb4be989a8")])

This would return a list of all columns for this row where md5sum=eeedeb58d744610e3cdb62fb4be989a8". This same function can take thecolumns` input as a list of strings if we wish to retrieve only select columns.

nsheff commented 9 months ago

Interestingly, it seemed to work in my hands retriving using the md5sum, as the record identifier. Why did that seem to work for me?

If we want to leave it as a separate column, we would need to make sure column is indexed in the database.

donaldcampbelljr commented 9 months ago

After our discussion yesterday, we decided that the md5sum column will be removed and this value will live in the record_identifier column. The generic name column will be renamed to bedfile_name and bedset_name. I've pushed these changes to dev: https://github.com/databio/bbconf/commit/5d9ebff9a00ca13f6f2bdea6c790acec4c987fa4

nsheff commented 8 months ago

Ok, I think this is solved now with dev pipestat, right?

donaldcampbelljr commented 8 months ago

Yes.