airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Addition of Diagnosis TimePoint #749

Closed bcorrie closed 4 months ago

bcorrie commented 4 months ago

closes #735

bussec commented 4 months ago

@bcorrie Is there any special reason why the non-Subject objects contain an actual Event record, instead of a reference (ID) to a record located in an array in Subject?

bcorrie commented 4 months ago

This is a minimal change to make use of the Event object in the current structure that we have. No real semantic or structural changes, just syntax changes so that we are using the Event object everywhere an "event" occurred in our current specification.

With this specification, we still have an array of Diagnosis where each Diagnosis has an Event that captures the Event-like features of a diagnosis. We do not have a general array of Events in Subject.

To do this correctly, with an array of Events, I think is beyond the scope of v2.0. In talking with @schristley for the AKC, it makes more sense for this to wait until v2.1.

So for v2.0 I would suggest we either make some simple changes to introduce the Event concept in v2.0 similar to what we have now, or if we want to consider the redesign of this we should move this issue out of the v2.0 milestone and into v2.1

I am fine with either. The nice thing about doing this now is that we add Event information like TimePoint to Diagnosis, which is currently missing (that was the original issue that was driving this change). But there is a good argument for implementing Event once and doing it correctly - which I don't think fits in the scope of v2.0.

bussec commented 4 months ago

I agree that the "maximalist" version of this is a substantial change and that it will require time and effort to get it done correctly. However, if we take the meaning of a major version number seriously, then IMO it should go into v2.0 exactly for that reason and not into a later point release. If we keep moving more and more stuff to v2.1, we might be able to release v2.0 in time, but will not meet the goal of having only non-major changes afterwards. In the end we need to agree on what we want/need to complete before we are "done" with the schema and what our timeline is. I have added this topic to the agenda of our next call.

bcorrie commented 4 months ago

Yep, but at the same time, if we keep adding major new things we will never release 2.0.

I think the more we see things being pushed to 2.1 the less I am convinced that "having only non-major changes afterwards" might not be practical - unless you are willing to accept that the current WG will be around for another year - and the v2.0 release is now targeted at the 2025 meeting 8-)

Then I can stop pushing so hard to get things done for June!!!

schristley commented 4 months ago

I don't think making a schema change at the last minute is a good idea. The Event concept needs more discussion and how it fits into an overall enhancement of study design curation. In particular because it causes disruption, i.e. code has to change.

I agree that the "maximalist" version of this is a substantial change and that it will require time and effort to get it done correctly. However, if we take the meaning of a major version number seriously, then IMO it should go into v2.0 exactly for that reason and not into a later point release.

That's not how I understand the major version number. A change in the major version number implies new requirements on researchers for MiAIRR for AIRR-seq studies. We don't want to change the MiAIRR standard at the last moment without putting in the time and effort for discussion. Right now V2 is about adding single-cell and a few other things to MiAIRR, that seems sufficient for now.

If we keep moving more and more stuff to v2.1, we might be able to release v2.0 in time, but will not meet the goal of having only non-major changes afterwards. In the end we need to agree on what we want/need to complete before we are "done" with the schema and what our timeline is. I have added this topic to the agenda of our next call.

I'm not sure we will ever be "done". As a domain-specific standards group, we need to keep up with the experimental protocols and other changes in the domain. The enhancement of study design curation, and other stuff, that will come out of AKC and NFDI4Immuno will likely feed back into MiAIRR V3 though it will take a few years.

bcorrie commented 4 months ago

So we are back to the question what is in v2.0 and what is for later discussion.

Currently we have a minimalist change (https://github.com/airr-community/airr-standards/pull/749/files):

These are relatively minor changes that clean up several issues in the original spec.

There are no major changes here IMO, and I would lean towards doing this for v2.0 (with minor extra tweaks) and leave the "re-engineering" of the concept of events in a study to a later discussion/release.

schristley commented 4 months ago

I consider this a fairly major semantic change. Adding a time point seems reasonable, but inserting the Event object in the middle will create a lot of churn when we change things later.

leave the "re-engineering" of the concept of events in a study to a later discussion/release.

IMO it doesn't. If forces the Event object to be used now in all the code. Every data entry screen has to be written to capture an "Event" instead of just recording a time point. Every python script and analysis tool has to write code that traverses through that "Event" in order to access a time point. Every query has to be written to traverse through that "Event" object to operate on the time point. And if all of that gets done, then we will be stuck with this structure because changing it is too much work. I can already see you writing the post later about how it breaks backward compatibility ;-)

I'm not opposed to adding an Event object. I would just prefer it be added in a way that doesn't require mandatory usage now with all the associated data and code changes. Even more so if this is just considered preliminary and we expect the design to change later.

Another idea which will create less churn is to put the Event object inside TimePoint, that is flip the structure, so the new fields in Diagnosis are TimePoint. Here we are just reversing the relationship, instead of saying an Event occurs at TimePoint, we are saying TimePoint is associated with Event. This would allow usage of the time point without requiring an event object to always be in the middle.

bcorrie commented 4 months ago

I am also OK to move this entire change out of v2.0 - I see your point if it is just going to change in the next version why do work now that will be thrown away in the near future.

So I would be OK with moving this out of the v2.0 milestone completely.

bcorrie commented 4 months ago

Remember this all started because Diagnosis doesn't have a TimePoint. It hasn't had a time point for 6 years (since MiAIRR was introduced) and it hasn't appeared to be a problem to anyone up until now except to us, so I am OK for it NOT to have a time point for another year. Especially if this means we get the v2.0 release out the door. 8-)

bcorrie commented 4 months ago

Another idea which will create less churn is to put the Event object inside TimePoint, that is flip the structure, so the new fields in Diagnosis are TimePoint. Here we are just reversing the relationship, instead of saying an Event occurs at TimePoint, we are saying TimePoint is associated with Event. This would allow usage of the time point without requiring an event object to always be in the middle.

The easiest thing to do is just add a TimePoint to the Diagnosis object. This doesn't even break backwards compatibility, it just adds a field. We can revert the changes to collection_time_point_* and drop Event for now.

This actually fixes #735 without changing anything else.

bcorrie commented 4 months ago

@bussec any more thoughts. I am not going to do any more edits on this until we decide what we actually want for v2.0... 8-)

bussec commented 4 months ago

@bcorrie I am ok with only doing small changes until mid-2024 (whatever that version will be called). Right now I have no opinion whether a TimePoint should have Events or an Event should have a TimePoint, but I agree with @schristley that if we only want to do small changes now, the former one is clearly (plus addition of a TimePoint to Diagnosis) is probably the least invasive thing to do.

Having said this, I still think that the whole "description-of-what-happend-when" package (including issues like #486 ) needs to be addressed before we go on hiatus, as it substantially limits the usefulness of the AIRR Schema if we don't.

bcorrie commented 4 months ago

@bussec @schristley I have reverted the Event changes and added a TimePoint to Diagnosis. This resolves #735 and we have agreed to do the big "event" lift after 2.0. So this can be merged I think.

bussec commented 4 months ago

Resolved merge conflicts. @schristley are we good to go?

schristley commented 4 months ago

@bcorrie @bussec Minor point, but the test code creates a template repertoire object and validates it, so it will already test null for that field. My suggestion is that you put the example values into the "good" test data files. Otherwise it looks good to go.

bcorrie commented 4 months ago

Added tests for proper validation for R and python as per @schristley request above. Merging...