Review of AIRR-Standards documentation

ustervbo commented 9 months ago

I reviewed the documentation and I have some questions, comments and suggestions on various sections of the document.

Random comments

Who is the target audience for the document? I am not a computer scientist and I read the document with the thought, 'How can I make our existing data MiAIRR compliant? How can I ensure that I gather the proper information in the future?' Here, I sometimes fall short. Not because I need to invest some time to understand, but because some parts simply seem inaccessible.

Maybe we should standardize the level names in the data model. The section 'MiAIRR-to-NCBI Implementation' uses slightly different terms for the levels. For instance, 'diagnosis & intervention' is mentioned in the bullet list in the section but only in the table in 'MiAIRR Data Elements', where it is 'diagnosis and intervention'. 'MiAIRR-to-NCBI Implementation' has 'processed sequences with basic analysis results' which is more detailed than 'processed AIRR sequences' used elsewhere (although 'basic analysis results' is non-descriptive). In the Nat. Comm schematic, there is no 'intervention' and the 6th level is called 'Processed Sequences with Annotations'.

The Repertoire Schema is UTF8, while the Rearrangement Schema is ASCII or UTF-8.

Sometimes we say OpenAPI V2, sometimes OpenAPI V2 and V3. (Actually, I think it's 1 all)

My understanding of the statement "The file can (optionally) contain an Info object, at the beginning of the file, based upon the Info schema in the OpenAPI V2 specification. If provided, version in Info should reference the version of the AIRR schema for the file." in 'Repertoire Schema > File Structure' is that we may not know the schema ID. Is this a problem? What is the purpose of the optional INFO field if it does not carry relevant information? If I understand the API correctly - and there is no guarantee that I am anywhere close - the schema is always returned, so the version number may be important.

study_description and study_contact in the Study-schema are missing in AIRR_Minimal_Standard_Data_Elements.

genotype in the Subject-schema is not really explained (and does not exist in AIRR_Minimal_Standard_Data_Elements).

Section specific comments

Section: MiAIRR Data Elements
- We say we have 6 high levels
- Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences.
- The table has [1-5]/[Level name] because 'processed AIRR sequences' are missing
- Suggestion
- Separate the table into sections, similar to 'Repertoire Fields' in 'Repertoire Schema'.
- Take inspiration from 'MiAIRR-to-NCBI Implementation' and use bullet points for the levels and description
- Explain why 'processed AIRR sequences' is missing - at least acknowledge its absence
Section: MiAIRR-to-NCBI Implementation
- "The current version (1.0) of the standard has been recently published [Rubelt_2017] and was passed by the general assembly at the annual AIRR Community meeting in December 2017." - Is this still true? The current version is 1.0?
- This section seems to be the only place where diagnosis & intervention is mentioned
Section: MiAIRR-to-NCBI Specification
- This sentence is nonsensical: "In terms of standard compliance, it is currently REQUIRED [1] to deposit information for MiAIRR data sets 5 and 6 in general-purpose sequence repositories for which an AIRR-accepted specification on information mapping MUST exist."
- Suggestion: Start with the ubiquitous 'we have six levels' and list them. It is dull reading, but ensures full understanding as a reference document
- The document generally speaks of data sets (which may be true, because of physical distribution) but elsewhere we talk of levels
- Table in 'Element mapping'
- 'diagnosis & treatment' should be corrected to 'diagnosis and intervention'
- Document is generally difficult to read, but maybe it works extremely well as a reference
- It seems a little out of place in 'Study Reporting', and totally misplaced between 'MiAIRR Data Elements' and 'Requirement Levels of AIRR Schema Fields'. Maybe it should just live in 'Data Submission and Query'?
Section: Requirement Levels of AIRR Schema Fields
- I don't have finer details of RFC2119 present in my mind - I like the glossary in 'MiAIRR-to-NCBI Specification'
- The sentence 'Importantly, fields are not elevated to this level based on' lacks a counterpart. When are fields elevated to essential?
- This sentence makes no sense to me: "However, IF information matching the semantic definition of the field is provided, this field MUST be used for reporting."
- Subsection: Compliance with the MiAIRR Data Standard
- This should be the first point: Data sets are considered MiAIRR-compliant ONLY IF all essential and important fields are reported.
- This is not super important and should be last: Compliance to the MiAIRR Data Standard is currently a binary state, i.e., data either is or is not compliant, there are not “grades” of compliance. However, additional requirements for specific use cases might be defined in the future.
- Who is this sentence for: Note that important fields with NULL-LIKE values MUST NOT be dropped from a data set.
Section: Metadata Annotation Guidelines
- Where are we in the six levels? How does this section connect to the rest?
- 'Clarification of Terms' - As for Requirement Levels of AIRR Schema Fields, a repetition of the definition might be appropriate
Section: AIRR Data Representations
- FAIR Principles: I have no idea of grammar in this case, but I like to start each entry with a capital letter
- AIRR Data Model: There are some inconsistencies with 'MiAIRR Data Elements' and 'MiAIRR-to-NCBI Implementation'
- I somehow need a connection to the six levels. I imagine that CellProcessing and NucleicAcidProcessing data model objects belong to 'Sample Processing and Sequencing'?
- I think this is the first place where 'Processed Sequences with Annotations' is actually filled with information
- A positive note: I like the discussion/explanation in 'Relationship between Schema Objects'
Section: Repertoire Schema
- A positive note: I like the explanation of Repertoire.
- What exactly are the types 'SubjectGenotype' and 'SequencingData'?
- The subsection 'Raw Sequence Data Fields' has no content.
Section: Rearrangement Schema
- 'A Rearrangement is a sequence which describes' - should it be 'A Rearrangement is a sequence that describes'? (Add backticks)
- The description of the category 'Alignment Annotations' could point to the CIGAR section.

javh commented 9 months ago

From the call:

Start with restructure in #730, then fix above as part of that PR.
Clarify that Rearrangement Schema is also UTF8.

scharch commented 9 months ago

Related: I notice that the software standard page references the "The AIRR Data Representation Working Group," which was decommissioned/folded into Standards ...pre-pandemic?? For that matter, we are tentatively planning to close up shop on the Software WG post-Porto, though the details have yet to be worked out...

ustervbo commented 4 months ago

Work on the documentation has started over at @eriicdesousa's fork: https://github.com/eriicdesousa/airr-standards

airr-community / airr-standards

Review of AIRR-Standards documentation #745

Random comments

Section specific comments