airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Move ADC custom fields to ADC extensions mechanism #627

Closed javh closed 2 years ago

javh commented 2 years ago

Following up on #602. Moved the ADC custom fields to the ADC extensions mechanism, per the decision in #589 and related issues.

Specifically:

bcorrie commented 2 years ago

@javh this doesn't work with the way the API query specification (adc-api.yaml) and the AIRR data model (airr-schema.yaml) are separated and used - see my comment here:

https://github.com/airr-community/airr-standards/pull/602#issuecomment-1209779855

The problem is that our repositories build the data model from the AIRR Schema data model. If a field isn't in the data model, there isn't a standard way to store, query, and use that field... The AIRR validation will not check the fields, etc.

I get very worried when these two diverge, in an ideal world the API spec would "just" reference the AIRR data model. From my perspective, there should be very few cases where the ADC API would require fields outside of the data model and that in all but very special cases, fields that the ADC requires should be part of the data model...

javh commented 2 years ago

@bcorrie. Gotcha. I understand (more or less). However, we don't have time/bandwidth to resolve this for v1.4, so let's address it in the next release. The ADC API will need to workaround it for now.

bcorrie commented 2 years ago

I agree we don't have bandwidth to make changes.

I would rather leave the spec as is and not move these fields from the AIRR Data Model. We have been building code against the current spec, if we move these fields now then as you say we have to make a work around - which I would rather not do.

So I would vote for not accepting this pull request

javh commented 2 years ago

Adding ADC custom fields to the spec is not what we agreed, so we'd have to come to a new agreement to include these fields in the release, which just isn't realistic right now.

I agree that an implementation workaround is annoying, but it is what it is. Besides, it's better than adding something that may be deprecated in the next release.

bcorrie commented 2 years ago

I would argue we did agree that we would add custom fields - sorry to be a pain - but I fundamentally believe that the ADC needs fields like these in the AIRR Standard data model. It is not sufficient for them to be part of the API specification. Just like annotation tools need fields beyond the MiAIRR Standard, the ADC needs additional fields. Having different specs for different layers of use/analysis is a bad thing in my opinion. Maybe we need to revisit that with the proliferation of objects in the spec, but now is not the time for that.

The reality is that those additions were agreed at the time they were made (through pull requests) as with most changes that were added to the spec.

https://github.com/airr-community/airr-standards/pull/493 https://github.com/airr-community/airr-standards/pull/574

The ADC fields have been part of the spec since Jun 11, 2021 and Apr 27, 2022 respectively and taking them out now is a significant last minute change IMO. I see that I changed the adc_annotation_cell_id after the last review from @bussec (although I did ask for another review), so I see how you can disagree with that field (so if necessary take it out - I give up). With that said, I feel strongly about removing the ADC fields in their entirety at this late stage. We have had almost two years (since Oct 15, 2020 - #472 ) to discuss the +/- of having them so removing them now is not appropriate...

javh commented 2 years ago

@bcorrie. The resolution to #589 is specifically for adc_annotation_cell_id to go into the ADC extensions. If we missed that detail approving PRs, that was a mistake. I'm not willing to reopen the discussion on #589 right now and I wager no one else is either.

We know your opinion. Your concerns are legitimate. Add the topic to the agenda for the next call. We do need to figure this out eventually, but I think a long term solution is non-trivial and needs to involve the CRWG.

Nothing is final until the release review. There are always last minute changes, cuts, and errors before the release. That's normal. If we ever want to tag the v1.4 release, then perfect cannot be the enemy of good. Let's focus on what is achievable this week.

As far as adc_publish_date and adc_update_date go, I'm personally fine with including them in the main spec for reasons I stated in #602. If we can reach consensus to include those two fields by Wednesday afternoon (August 17), then I'm good with that. Then I can wrap up the release Thurs-Sun. If the release doesn't happen this week, then it's going be at least another 6-8 weeks due to my calendar.

bussec commented 2 years ago

@javh @bcorrie My two cents:

  1. Let's not re-open #589 now, v1.4 is overdue, we need to get it out now.
  2. I would be fine with leaving adc_publish_date and adc_update_date in the main spec, as they are in my mind simply metadata about a study that refer to an event that happened in the ADC.
  3. adc_annotation_cell_id is a different beast, as it doesn't feel right to move it to the ADC extensions (because it an alternative identifier that has nothing to do with the ADC), but it keeping it in the main spec raises all the other issues discussed before. This is something that clearly will be on the slate again for v2.0 so I would opt for moving it to the ADC extensions for now.
scharch commented 2 years ago

The resolution to #589 is specifically for adc_annotation_cell_id to go into the ADC extensions. If we missed that detail approving PRs, that was a mistake.

bcorrie commented 2 years ago

I think we all agree that adc_annotation_cell_id is controversial - so for now, I am not concerned where it lands. I can accept that it jumped from the extension to the spec without enough review so we can move it back to the extensions for now if you want...

What I am concerned about is the statement above

Moved the ADC custom fields to the ADC extensions mechanism

Implying that ADC required fields don't belong in the spec 8-) But I think we are OK there as we have @javh saying:

As far as adc_publish_date and adc_update_date go, I'm personally fine with including them in the main spec for reasons I stated in #602. If we can reach consensus to include those two fields by Wednesday afternoon (August 17), then I'm good with that. Then I can wrap up the release Thurs-Sun. If the release doesn't happen this week, then it's going be at least another 6-8 weeks due to my calendar.

And @bussec saying:

  1. I would be fine with leaving adc_publish_date and adc_update_date in the main spec, as they are in my mind simply metadata about a study that refer to an event that happened in the ADC.

So if adc_publish_date and adc_update_date go back in the main spec and we wrestle with the other mess later, I am good with that...

javh commented 2 years ago

Implying that ADC required fields don't belong in the spec

Yeah, that's correct. ADC custom fields don't belong in the spec (whether the ADC requires them is irrelevant). Same for other tools that implement the spec. Ie, tools implement the spec, they don't define it. The ADC gets "most favored nation" status, but there's going to be some differences, especially as we wind down the standards work.

Regardless, seems like we are close to consensus on adc_publish_date and adc_update_date, albeit for different reasons. I, like @bussec, do not see these as ADC custom fields, but instead study fields that describe data publication. If they were ADC custom fields, then I would cut them, which is why I cut them after the conversation in #602. Doesn't really matter right now, because it looks like we can agree on the "what" without agreeing on the "why". Just important context because we may have to clarify the semantics for v2.0 (ie, are these user populated or ADC populated).

Let's give everyone until tomorrow to chime in. Then I'll make whatever changes are needed.

schristley commented 2 years ago
  1. I would be fine with leaving adc_publish_date and adc_update_date in the main spec, as they are in my mind simply metadata about a study that refer to an event that happened in the ADC.

MiAIRR was defined before the ADC. In my discussion with Jason, I suggested that for MiAIRR V2 that we might consider creating a new category of fields, maybe like a set 7 or an extended set 6, which is essentially additional minimal information that should be provided when a study is put into the ADC. These two date fields fall into that category. This would provide "justification" to have them in the schema. That's part of the larger discussion if MiAIRR should be extended beyond just SRA submission and actively promote ADC submission.

javh commented 2 years ago

Just to piggyback on @schristley's comment. I think it would be good to promote deposit into the ADC. We also discussed how to we might add ADC field validation to the R/python libraries via a flag that pulls in the extensions. Plus, how that's all going to work once the Standards get (effectively) frozen, but the ADC API still needs updates. Bunch of stuff to sort out later.

bcorrie commented 2 years ago

I have tabled this for discussion at CRWG tomorrow AM, but based on the above, assuming adc_publish_date and adc_update_date go back in the spec, I am comfortable recommending that CRWG accept this change. The rest we can sort out later...

javh commented 2 years ago

No, with no objections I've moved adc_published_date and adc_update_date into the spec and will go ahead and merge this

bcorrie commented 2 years ago

We discussed this at the CRWG meeting today, and agreed that this was an acceptable way forward.

There is definitely interest from the CRWG in further discussion on how we can ensure that the needs of the AIRR Data Commons are met in the AIRR Standards.