Closed bendichter closed 2 years ago
i do agree that any metadata failure should be clearly communicated.
however, i would prefer not to leave it blank since age is one of those filters that's absolutely necessary in most of neuroscience and structuring it is hard. the ISO 8601 standard allows range, and i believe dandi allows it as well (will need to check). so if they stick age in ISO8601 as a duration i believe that should work.
see 1 and 2 here: https://en.wikipedia.org/wiki/ISO_8601#Time_intervals (i'll check if we support it in the CLI extractor).
more importantly, it would be really helpful if such issues were posted or slacked. we really do not mind any help requests.
age is one of those filters that's absolutely necessary in most of neuroscience
I would like to push back on this claim. In my experience, it really depends. I can see why some labs do not track precise age information for their subjects. I have never done a study where I used age as a parameter. If you are doing developmental neurobiology age is obviously important, but for sensory, motor, memory, etc. it might not be.
the ISO 8601 standard allows range.
Yes, the ISO 8601 standard does allow for time intervals. As you can see from the link you provided, the options they have are
None of these would not be appropriate for age, where we would need two durations, a lower bound and an upper bound. I do not see anything in the ISO8601 standard that could be used for a range of durations. I suppose we could use 1 or 2 for the date of birth, but we I think it would be a bit awkward to indicate a time range for date of birth without an explanation that this is for approximating age.
more importantly, it would be really helpful if such issues were posted or slacked. we really do not mind any help requests.
For non-programmers, it is difficult to post on Slack because 1) the are embarrassed and 2) they don't know what went wrong or what to post. The initial report that I got from this user was that they ran all the commands in the command window and no data showed up on DANDI. Upon meeting, there were several issues I discovered over multiple hours-long support sessions:
None of these issues were clear to the user, and working with them honestly they weren't clear to me either until I really dug in.
since you posted the other issues, let's focus this one on age.
start/duration
for it. we would have to build semantics, which i don't like, but possible.<duration>/<duration>
(e.g., P24H/P48H
). @djarecka - since you wrote the age parser, what did we do with ranges? i remember there were some dandisets with ranges.we should start moving the age parser to expect these formats or raise an exception. the only reason we wrote the rules was to move to this schema for existing files.
I believe we had some discussion about it but we didn't reach the conclusion. I believe the parses would give an error now...
I've checked again the ISO document and I can't find what we need, but I believe using <duration>/<duration>
would be the closest to the iso8601 spirit. A single hyphen should not be used for it, but double hyphens could be an option
Based on section 3.3.4 and 4.4.2 of this document btw. this is not the final version, but the final is paid...
Allowing <duration>/<duration>
would be an improvement. It's not official ISO 8601 but I agree it is in the spirit. However, even with this improvement there are still three additional use-cases we would not cover:
It really does not make sense for us to enforce that age must follow a formal system if that formal system does not handle these three use-cases. Doing so will cause a lot of frustration for many of our users and for me as I try to help them.
Here is a way we could change the NWB schema that could cover all of the use-cases:
DANDI would check if age
is valid ISO 8601 duration. If not, it would check for upper and lower bounds, both of which, if supplied, must be ISO 8601 duration. age.reference
is by default assumed to be "birth" but can be set to "fertilization." Other terms may be added for age.reference for studying species with different types of reproductive patterns as these cases are brought to our attention.
Here are some examples of how the system would be used:
A human over 90 would be:
An embryo of 2 months would be:
The fly use-case that started this thread would be:
An alternative approach here could be the <duration>/<duration>
format:
A C. Elegens might be:
The question then is what do we allow vs. not? In the human case, you would be missing an upper bound. Is that acceptable? Would be require C. Elegens to include a lower and upper bound?
@satra to answer your question about the nwb inspector, we do have a check for this. It raises the message:
2.7 check_subject_age - 'Subject' object at location '/general/subject' Message: Subject age, '24-48 Hrs', does not follow ISO 8601 duration format, e.g. 'P2Y' for 2 years or 'P23W' for 23 weeks.
However this is listed as a Best Practice Suggestion. I have created a draft PR to promote these checks to Critical here but again, I don't think we should make strict requirements like this until we really consider the different use-cases of our users.
i like changing the age structure. it would be nice if NWB changed age, genotype, and others to match dandi for example, but if we keep str we can structure it as
<duration>/reference
would be the default for precise age
<duration>~/reference
(the ~
is an ISO8601 extension for approximate)
<duration>/<duration>/reference
the first / would be required to indicate bounds and therefore two //s would be required when reference is provided. to avoid this, we could consider a different symbol for the reference separator, but it cannot be anything in the iso standard.
where reference is in ["Birth", "Gestation", ??"Culture"] what is the age of an organoid? when reference is not provided, birth is assumed, and the inspector could say so in it's report.
i would say for age range, either side could be optional to indicate, younger than or older than.
regarding reference, currently dandi uses BirthReference
or GestationalReference
.
I don't think we should make strict requirements like this until we really consider the different use-cases of our users.
use-cases are going to evolve, but making something capture current use-cases is a sufficient approach for strictness. we can adjust as new use cases come up instead of trying to be exhaustive. that's what versioning a schema gives us.
Instead of using lower bound, upper bound we can use a syntax with ../upper_bound
, lower_bound/..
as suggested by ISO in the extension, see part 4.4.
However there is a question what are the form we accept from the user. Previously was making some effort to guess what a user had in mind when using free text (e.g. 2 weeks 1day
), but when we want to cover more examples, the guessing will be harder if not impossible and we might need to enforce ISO8601-dandi syntax. What might be a good thing.
However there is a question what are the form we accept from the user.
i believe the plan here is not to interpret, so as of some version of our schema, we would not allow any age metadata interpretation, beyond what we are discussing here. so no free text. that would always generate critical dandi-related warnings, hopefully at the inspector level.
should I start changing the function? some previous dandisets will fail if we enforce ISO8601 and remove my "guessing" part
i think we should start a PR, but let's not remove the guessing part yet. let's first implement range support and tests based on this discussion.
:rocket: Issue was released in 0.46.1
:rocket:
I was just on a call with a user trying to upload data and she was blocked by a few things.
One of the problems was that she was expressing age as "24 to 48 hrs". This is valid in NWB. Our recommended best practice is to use ISO 8601 duration format, but that is not a requirement. DANDI failed with a message "failed to...". After some digging it became clear the message was "failed to extract metadata" and that age was the culprit. This was really bad user experience. She has been trying to upload this data for weeks, spending hours of her time on debugging and was clearly exasperated at this point. Even with me remote controlling into her computer it took 40 minutes to sort all of this out. In the end, she'll need to remove the age field and run her conversions again, which creates the same result as if dandi had simply ignored age values it cannot parse to begin with. That would be my preferred behavior: to permit upload and simply not extract age metadata if it cannot be parsed.
We have had several conversations with the NWB team about how best to handle cases like this in the standard when the exact age is not known. I think ultimately it might be useful to have lower_bound and upper_bound as optional attributes of the age dataset, both of which, if included, must be ISO 8601, so that age ranges can be parsed by DANDI. Until then, would it be possible to be a little more relaxed about age metadata?