Closed nsheff closed 8 months ago
In the refactored version of pipestat (current dev branch), pipestat now uses record_identifier
instead of sample_name
or project_name
.
I'm a little confused that name
is in the output_schema. I had assumed it would be synonymous with record_identifier
when reporting results via pipestat. What is the actual record_identifier
for these bedfiles? Previously sample_name
was a reserved keyword that was not allowed in the output_schema which is probably why name
was used instead.
Do you have an example error?
Currently, in the database, there are two columns, name
and sample_name
.
name
should be the result_identifier and sample_name
is the record identifier.
Next steps:
I looked into this with a local database where I reported results using bbconf
and then attempted to use the bedhost API to retrieve results.
I am able to report and retrieve results using only bbconf just fine. However, I encounter issues when using the bedhost API.
When bedhost uses bbconf to retrieve using pipestat, many of the endpoints use md5sum
as the record identifier. Unfortunately, md5sum
is actually a result_identifier per the output_schema.
Example:
In bed_api.py
Endpoint @router.get("/bed/{md5sum}/metadata", response_model=DBResponse)
try:
values = bbc.bed.retrieve(md5sum, attr_ids)
This should actually be:
try:
values = bbc.bed.retrieve(record_identifier="bedfile_identifier", result_identifier=md5sum)
Of course, I believe we want the record_identifier=md5sum
Question:
Do we want to remove the md5sum
result_identifier
and have the digest live in the record_identifier
? Or should we put the digest in both columns?
I recall some confusion over this when I was running things as well. But wait: I think you're missing something here. Your second example is not doing what the first example was intended to do, is it? If so, that's a really confusing API... How will you specify which record to use? It's not the string "bedfile_identifier"...
I think the point is that the record identifier should be the md5sum. At least, that's what makes sense to me.
If the record identifier isn't currently the md5sum... then what is the record identifier?
"bedfile_identifier" was just a generic example as to how pipestat.retrieve
currently works.
The record_identifier/sample_name
is whatever string is currently in sample_name
column within the production database.
Just another option:
We can also use pipestat.backend.select
to select the row based on the md5sum, e.g.
bbc.bed.backend.select(filter_conditions=[("md5sum", "eq", "eeedeb58d744610e3cdb62fb4be989a8")])
This would return a list of all columns for this row where md5sum=eeedeb58d744610e3cdb62fb4be989a8". This same function can take the
columns` input as a list of strings if we wish to retrieve only select columns.
Interestingly, it seemed to work in my hands retriving using the md5sum, as the record identifier. Why did that seem to work for me?
If we want to leave it as a separate column, we would need to make sure column is indexed in the database.
After our discussion yesterday, we decided that the md5sum
column will be removed and this value will live in the record_identifier
column. The generic name
column will be renamed to bedfile_name
and bedset_name
. I've pushed these changes to dev: https://github.com/databio/bbconf/commit/5d9ebff9a00ca13f6f2bdea6c790acec4c987fa4
Ok, I think this is solved now with dev pipestat, right?
Yes.
It appears that the bedhost-ui was expecting an attribute called 'name'.
also, the
bedfile_schema.yaml
says there's an attribute called 'name'.pipestat is ingesting this schema...but then, in the actual items retrieved from pipestat, there is no 'name', instead there is
sample_name
.So, this is causing a lot of problems.