airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Germline lang #581

Closed williamdlees closed 2 years ago

williamdlees commented 2 years ago

Added R and Python support for GermlineSet and GenotypeSet

williamdlees commented 2 years ago

Some notes on this: I have updated the news items and so on but have left the version number as 1.4 in both cases because it looks as though there are unreleased changes.

The Python 2 unit tests are failing because json.dumps() always returns strings as unicode. The repertoire tests use yaml. If they used json, they would have the same problem. I could fix by coercing all the unicode to strings just after the json loader, or change the schema validator to recognise unicode as a valid string. Or perhaps we should drop support for Python 2 now? Please advise.

schristley commented 2 years ago

Hi @williamdlees , thank you very much for this. Unfortunately, some of the code will require some rework. I'm sorry I didn't relate this to you earlier (I would have if I knew you were writing this code), but we decided in an earlier discussion to avoid the route of creating more read/write functions that are specific to each object (e.g. load_repertoire, load_germline_set) and instead have generic read_airr/write_airr functions for the AIRR Data File. As typical, we put off doing the work until it was actually needed, which is clearly now...

If you look the AIRR Data File schema, you can see that it can 1) hold multiple AIRR objects in a file, 2) it can be read/write in with standard json/yaml, and 3) validation could walk through the keys to validate each object type that is present. Hopefully that also explains a bit why I changed the file format for germline sets. We would still keep the old repertoire functions for backward-compatibility. And, to be thorough, we'd rework the doc a bit to describe the AIRR Data File and all the relevant schema objects could just point to it.

Also, an AIRR Data File or a set of AIRR Data Files (as well as AIRR TSV files) could be provided together as an "AIRR data set" and there would a manifest #426 to tie the files together, but we are still working on the manifest definition.

I'm unfortunately in "grant hell" at the moment, otherwise I'd take a pass at changes, but I encourage you to give it a pass if you can. Sorry again for all the rework involved but hopefully much of what you've written can be kept, just restructured.

williamdlees commented 2 years ago

Thanks Scott. I don't have any issue with the overall direction, but I don't feel confident about moving forward with the restructuring myself, because I can't see clearly enough from the thread you reference what the high level functions would be, and how they might work if, for example, one just wanted to extract one element or type of element from a compound file, list what's in it and so on. Perhaps it is best that I leave this to you and others, and let you draw on the functions and test cases I wrote for germline sets? Also I hope the Python support for compound objects will be useful. I am happy to contribute as the restructuring moves forward, but I don't think I;m the right person to take the lead.

Sorry, I should have let you know I was planning to work on this. I was pressed to add support by the SW WG. I guess they can use my branch for the time being.

schristley commented 2 years ago

Hi @williamdlees , is it correct that GenotypeSet is supposed to be a top-level object? The earlier work done only had GenotypeSet embedded within a Subject, which implies that it does not standalone. However in this PR, you have a genotype test file with a top-level GenotypeSet, which implies they can be loaded and modified by themselves, unattached to a Subject. I think that makes sense, but wanted to verify with you.

schristley commented 2 years ago

Hey @javh , I just added the generic load_airr/write_airr functions that we discussed earlier. Feel free to take a look but the changes are pretty straight forward. Actually, did we want to rename load to read?

My main question is how do we want to handle deprecation? Specifically, the load_repertoire, validate_repertoire, and write_repertoire functions should be deprecated, and users should switch to the generic functions. One idea is to just have a deprecation message print out when the functions are called, but is there a more pythonic way?

williamdlees commented 2 years ago

Yes, GermlineSet could either be a top-level object, for example in the output from an inference tool, or it could be embedded in a Subject record.

schristley commented 2 years ago

Hello @williamdlees , I've decided to disengage from the AIRR Community. I'm sorry that this effects your work. I've been able to modify much of the python code for the new design, but unfortunately none of the R code has been modified. I've created a checklist in the top comment indicating what I consider are the main tasks, and what I've been able to accomplish, and what's left to do. Hopefully somebody in the AIRR Community with R skills can come forward and finish the work.

javh commented 2 years ago

From the call:

javh commented 2 years ago

I'm (finally) working though this and I have a question. Looking at the Repertoire example yaml file, Repertoire contains a list of Repertoire objects. Is this correct? Are we still doing that? I thought RepertoireGroup/Set was going to replace that.

I see:

Repertoire:
  - repertoire_id: 1841923116114776551-242ac11c-0001-012
    study:
      study_id: PRJNA300878
...
  - repertoire_id: 1602908186092376551-242ac11c-0001-012
    study:
      study_id: PRJNA300878
schristley commented 2 years ago

I'm (finally) working though this and I have a question. Looking at the Repertoire example yaml file, Repertoire contains a list of Repertoire objects. Is this correct? Are we still doing that?

Yes, don't think there was discussion on changing that. It's still the simple way to list all the repertoires in a data set.

I thought RepertoireGroup/Set was going to replace that.

RepertoireGroup only holds the repertoire_id so the complete repertoire is provided elsewhere, i.e. in the list above.

javh commented 2 years ago

Okay, got it. One more question then, there's no names for the list of repertoires returned by the read function. It's just an unnamed list. To get the IDs you have do something like

sapply(x$Repertoire, `[[`, "repertoire_id")

Is this intended? And consistent between python/R (I'm starting with the R package).

schristley commented 2 years ago

Okay, got it. One more question then, there's no names for the list of repertoires returned by the read function. It's just an unnamed list. To get the IDs you have do something like

sapply(x$Repertoire, `[[`, "repertoire_id")

Is this intended? And consistent between python/R (I'm starting with the R package).

Yeah, to handle the situation where IDs haven't been assigned yet, thus a list instead of a dictionary. The python doc shows how to use a comprehension to generate dictionary if you know/expect them to have IDs.

javh commented 2 years ago

Okay. We could make that an argument. Eg, names=FALSE which when set to TRUE will extract the IDs.

Last question for at least the next 5 minutes... :) The reader functions return a list of top level objects in the file, so you have to do:

x <- read_repertoire(file)
x$Repertoire

To get the actual Repertoire data. I get why you would want do this with the generic functions, as you might have multiple top level objects in a file (especially the metadata file with Repertoire in it). But, I find it somewhat confusing that a function named read_repertoire does not return the Repertoire object explicitly (and specifically). Thoughts?

schristley commented 2 years ago

To get the actual Repertoire data. I get why you would want do this with the generic functions, as you might have multiple top level objects in a file (especially the metadata file with Repertoire in it). But, I find it somewhat confusing for a function named read_repertoire not to return the Repertoire object explicitly (and specifically). Thoughts?

Yeah, that's why we discussed changing the function names to generic load_airr/write_airr but need to decide how and if to deprecate the read_repertoire and other functions.

javh commented 2 years ago

Ah, okay, all these check boxes are starting to make sense to me. v1.3.1 didn't have read_repertoire, so we don't need to worry about deprecation. Let me chew on it a little, but I think we can get away without any deprecation by keeping the v1.3.1 function names read_airr and write_airr, but add a format argument and set the defaults for them differently. Maybe we deprecated the _alignment functions because they are never leaving experimental.

Then we'd really just have read/write rearrangement (tsv) and read/write for AIRR Data Model data.

schristley commented 2 years ago

Ah, okay, all these check boxes are starting to make sense to me. v1.3.1 didn't have read_repertoire, so we don't need to worry about deprecation. Let me chew on it a little, but I think we can get away without any deprecation by keeping the v1.3.1 function names read_airr and write_airr, but add a format argument and set the defaults for them differently. Maybe we deprecated the _alignment functions because they are never leaving experimental.

R and python were inconsistent with the v1.3.0 release, only python had the repertoire functions. The R functions are new and haven't made it into a release yet. For python, maybe it's worthwhile for those repertoire functions to print out something like "read_repertoire is deprecated, please use read_airr"? Then can be completely delete next release.

Then we'd really just have read/write rearrangement (tsv) and read/write for AIRR Data Model data.

and validate! I think our validation now works with all schema objects.

javh commented 2 years ago

and validate! I think our validation now works with all schema objects.

Yep! Seems to be working well. I'm going to make some tweaks so the schema is autodetected (by name) rather than passing in the schema explicitly. This will have the consequence that we cannot put multiple objects with the same name in a file. I suppose we could implement support for duplicate objects, but I'm not sure we should.

javh commented 2 years ago

The R package is nearly done. I've reworked/simplified the interface, updated the vignette, and cleaned up the man pages. There's a few things I have to do still, but if someone could sanity check the new interface that would be great. Easiest thing would be to skim the docs. I fixed the RTD build, rebuilt the R docs, and posted them here: https://docs.airr-community.org/en/germline_lang/packages/airr-R/overview.html

Outstanding issues: 1 The big one is that the example/test GermlineSet and GenotypeSet files aren't compliant (the ones in lang/R/inst/extdata). Does anyone have a new/correct version of these I can drop in? The warnings I see are:

Warning: GermlineSet object is missing AIRR mandatory field(s): germline_set_id, author, lab_name, lab_address, release_version, release_description, release_date, germline_set_name, germline_set_ref, species, locus, allele_descriptions

Warning: GenotypeSet object is missing AIRR mandatory field(s): receptor_genotype_set_id

2 The Tree schema can't be loaded with load_schema. Probably because of additionalProperties, but I haven't had a chance to figure it out yet. So, schema checks are temporarily failing because I commented out the link to Tree in DataFile until I figure it out.

3 Add a couple unit tests for repertoire/germline. Wasn't planning to do much with this.

4 There's a several items related to DataFile in the check list. Not sure I want to get into this for v1.4. How necessary are these?

5 I'll look at the python package once the R package is done.

williamdlees commented 2 years ago

Jason,

Re. the example genotype and germline set, it says it’s missing all the mandatory fields in the outer layer. Perhaps it doesn’t like the way they are wrapped?

{

"GermlineSet": {

    "germline_set_id": "OGRDB:G00007",

                            "author": "William Lees",

It looks to me as though it is asking for the outer { and the GermlineSet key to be removed (which is weird, because I could swear it wanted them before!)

javh edit: removed extra email text.

javh commented 2 years ago

Thanks, @williamdlees. That shouldn't matter - it works fine for Repertoire. If the example files look current to you, then it must be a bug in the validate function. I'll poke at it.

javh commented 2 years ago

The issue was that validation was expecting a list of objects under the object name, like how the Repertoire file is setup, so it was parsing each property in GermlineSet as a separate GermlineSet object. (Note, I combined the GermlineSet and GenotypeSet examples into a single JSON file in what's shown below.)

> library(airr)
> f1 <- system.file("extdata", "repertoire-example.yaml", package="airr")
> f2 <- system.file("extdata", "germline-example.json", package="airr")
> repertoire <- read_airr(f1)
> germline <- read_airr(f2)
> glimpse(repertoire)
List of 1
 $ Repertoire:List of 3
  ..$ :List of 5
  .. ..$ repertoire_id  : chr "1841923116114776551-242ac11c-0001-012"
  .. ..$ study          :List of 13
  .. ..$ subject        :List of 15
  .. ..$ sample         :List of 1
  .. ..$ data_processing:List of 1
  ..$ :List of 5
  .. ..$ repertoire_id  : chr "1602908186092376551-242ac11c-0001-012"
  .. ..$ study          :List of 13
  .. ..$ subject        :List of 15
  .. ..$ sample         :List of 1
  .. ..$ data_processing:List of 1
  ..$ :List of 5
  .. ..$ repertoire_id  : chr "2366080924918616551-242ac11c-0001-012"
  .. ..$ study          :List of 13
  .. ..$ subject        :List of 15
  .. ..$ sample         :List of 1
  .. ..$ data_processing:List of 1
> glimpse(germline)
List of 2
 $ GermlineSet:List of 17
  ..$ germline_set_id      : chr "OGRDB:G00007"
  ..$ author               : chr "William Lees"
  ..$ lab_name             : chr ""
  ..$ lab_address          : chr "Birkbeck College, University of London, Malet Street, London"
  ..$ acknowledgements     : list()
  ..$ release_version      : int 1
  ..$ release_description  : chr ""
  ..$ release_date         : chr "2021-11-24"
  ..$ germline_set_name    : chr "CAST IGH"
  ..$ germline_set_ref     : chr "OGRDB:G00007.1"
  ..$ pub_ids              : chr ""
  ..$ species              : chr "Mouse"
  ..$ species_subgroup     : chr "CAST_EiJ"
  ..$ species_subgroup_type: chr "strain"
  ..$ locus                : chr "IGH"
  ..$ allele_descriptions  :List of 2
  .. ..$ :List of 39
  .. ..$ :List of 39
  ..$ notes                : chr ""
 $ GenotypeSet:List of 2
  ..$ receptor_genotype_set_id: chr "1"
  ..$ genotype_class_list     :List of 1
  .. ..$ :List of 6

I added a hack to validate_airr to check for empty top-level names (like the Repertoire example). Then it treats the object as a list of the same object. I can't say I like that much, as it's not robust. If you want to name each object, say using the repertoire_id, then validation will break.

Suggestions?

javh commented 2 years ago

R package is done except for:

  1. Above question about object lists vs single entries.
  2. I dropped the DataFile schema caching, because of the weirdness with Tree. Need to discuss this. I'm inclined to skip for v1.4.
  3. JSON and YAML objects are not exactly identical, because simple character arrays are loaded as lists from the JSON parser and vectors from the YAML parser. Not sure if/how to fix this.
  4. Updating the NEWS.
javh commented 2 years ago

Scratch the R package JSON primitive type vector issue. I figured it out.

javh commented 2 years ago

Python package is now done except for the same stuff as the R package:

  1. I used the same hack to distinguish records contains lists of objects from individual objects. Yay or nay?
  2. I dropped DataFile support. Again, something to discuss. There's already a lot going on in this PR, so I'm going to make a separate one for DataFile regardless of what we decide.
  3. NEWS. Will update these after changes are agreed upon and approved.

The GermlineSet test files need a sanity check. I had to make the following changes to pass tests:

  1. Make species a dictionary/ontology (it was a single string before). This passes in the R package, but failed in the python package. Presumably because the R package isn't checking that 'object' properties are analogous to a dict (haven't had a chance to look at this yet).
  2. Add the allele_description_ref property. This is set nullable: false but is not listed in required. Thus, if the property is missing, it fails validation in the python package but passes in the R package. Not sure what the correct behavior is supposed to be.
  3. Rename notes to curation.
  4. Remove escape characters from text blocks (eg, \r\n) as this caused the python JSON parser to choke.
schristley commented 2 years ago
  1. I used the same hack to distinguish records contains lists of objects from individual objects. Yay or nay?

I might have missed some messages, but why do you need a hack? Distinguish list from individual objects, why is this needed?

javh commented 2 years ago

I might have missed some messages, but why do you need a hack? Distinguish list from individual objects, why is this needed?

The Repertoire example is a list of Repertoire object embedded under the title "Repertoire". The GermlineSet example is a single GermlineSet. Seems like each object can be either an object or a list of objects.

schristley commented 2 years ago

I might have missed some messages, but why do you need a hack? Distinguish list from individual objects, why is this needed?

The Repertoire example is a list of Repertoire object embedded under the title "Repertoire". The GermlineSet example is a single GermlineSet. Seems like each object can be either an object or a list of objects.

Hmm, I designed the DataFile to match how the API works, and both always have a list of objects, never a single object, so I'm not sure where the either/or came from. Are you sure that example is just not wrong and just needs [] added to be a list?

javh commented 2 years ago

Are you sure that example is just not wrong and just needs [] added to be a list?

I definitely not sure the examples aren't wrong. About either the list or other things. The python examples did have internal brackets, but the same example in lang/R did not. Because I did R first, I "fixed" the python ones...

Is that always supposed to be true for all objects in the schema? Is every object supposed to be a list of objects?

schristley commented 2 years ago

Are you sure that example is just not wrong and just needs [] added to be a list?

I definitely not sure the examples aren't wrong. About either the list or other things. The python examples did have internal brackets, but the same example in lang/R did not. Because I did R first, I "fixed" the python ones...

oh, yeah, another pain with having duplicated tests across languages...

Is that always supposed to be true for all objects in the schema? Is every object supposed to be a list of objects?

In the DataFile and in the API, yes it's always a list, to support multiple objects. I suppose it's possible to define a schema for both a singular object and a list using oneOf but doesn't seem worth the complexity, easier to just make it a list always.

schristley commented 2 years ago

Are you sure that example is just not wrong and just needs [] added to be a list?

I definitely not sure the examples aren't wrong. About either the list or other things. The python examples did have internal brackets, but the same example in lang/R did not. Because I did R first, I "fixed" the python ones...

Sorry, the R example might have led you astray. I'm pretty sure that I got all of core python code working for the correct schema definitions, with correct examples. If there's disagreement between R and python then I'd rely on python and modify R to match.

javh commented 2 years ago

Neither study nor subject are a list in the Repertoire example. Would Info, DataFile and Ontology objects also be lists? It doesn't seem universal. If it's supposed to be, then we can fix it, but it seems weird to me.

schristley commented 2 years ago

Neither study nor subject are a list in the Repertoire example. Would Info, DataFile and Ontology objects also be lists? It doesn't seem universal. If it's supposed to be, then we can fix it, but it seems weird to me.

No, it's not universal. DataFile isn't suppose to be a normal-form version of the data, so Study and Subject are still embedded in Repertoire. Objects like Info are special as they aren't really data but metadata about the file. The best way I think about DataFile is that it should match the APIs. So you if you take the response from an ADC query, you can pipe it to a file and you immediately have a valid format AIRR data file, no additional parsing or conversion required.

So what's in DataFile are objects that we might consider "top-level objects", i.e. ones with their own IDs with an explicit ADC query end point. However, DataFile is a bit forward looking to V2, with for example, DataProcessing shown as a top-level object. Also, the experimental objects like Clone and others haven't been tested much in the APIs, so they might change in the future.

williamdlees commented 2 years ago

Regarding the high-level objects, it’s certainly ok with me to have these always held as a list in the file and will alter toe ogrdb code if that’s our decision

javh edit: remove email headers

javh commented 2 years ago

Hrm. Sounds like there's a need for both possible structures then.

Also, I have confusion about the point of DataFile, but I think I'll save it for a separate PR, just so we can get this one wrapped up.

javh commented 2 years ago

From the call:

@javh tasks:

@williamdlees tasks:

Everyone else:

After that, we're good to merge.

schristley commented 2 years ago

Regarding the high-level objects, it’s certainly ok with me to have these always held as a list in the file and will alter toe ogrdb code if that’s our decision javh edit: remove email headers

@williamdlees Yes, please, can we do this? It sounds like @javh is working himself into a pretzel writing hacks to support a bunch of complexity that we don't even need in the first place. Just add the [] so it can be treated like the other objects (Repertoire, RepertoireGroup, etc). Besides, we need the list so we can store more than one GermlineSet in a file. That's why the documentation for the GermlineSet file format says The file should correspond to a list of GermlineSet objects, using GermlineSet as the key to the list.

DataFile is just a formalization of the Repertoire file format with support for additional object types.

javh commented 2 years ago

Yeah, I'm pretzeling a bit, but I think it'll be good long term. I think the ideal solution is to do both:

  1. Set the expectation that the top level should be an iterable for almost all objects. Right now (and pre-hack), the validator is dependent upon a successful type check and only supports an unnamed list type, so if you were to assign repertoire_id as the key for each individual Repertoire then validation wouldn't work. This is even murkier in R because a list and dict are the name type in R.
  2. Use the same solution designed for the above to figure out if the record is a single object instead of a list or dict of objects. Ie, get solution (2) for free from solution (1).

Re: DataFile, I think I get it now after the call, but I'll make a separate PR.

williamdlees commented 2 years ago

@javh The ‘good’ test data in Python is good, subject to any mistakes I have made in implementing the spec. It matches current output from OGRDB

Are you saying that you are finding the backslash character followed by an r or an n in the text block, or the ASCII characters for and ? I can easily get rid of the former but I wonder if you have an example? Would like to make sure I nail it where you are seeing it…

javh edit: Remove email headers

javh commented 2 years ago

@williamdlees, the backslashes. I didn't fix lang/R/tests/data-tests/bad_germline_set.json. You can see the problem on line 345:

From: "Imported to OGRDB with the following notes:\r\nwatson_et_al: CAST_EiJ_IGHV8-2"

To: "Imported to OGRDB with the following notes: watson_et_al: CAST_EiJ_IGHV8-2"

I really should fix the python library so it can handle these characters, but it'd nice to start from clean data first if feasible (like the list issue).

williamdlees commented 2 years ago

@javh sorry, I should have spotted that. Looked in the Python test case and assumed the other was the same.

I want to fix it at my end too, don’t want those characters floating around. Is an ASCII line feed ok?

javh commented 2 years ago

@williamdlees , oh, the "good" files should be the same in R/python. I edited them to pass the validation.

If by ASCII line feed you mean "0x0A", then it should be fine.

williamdlees commented 2 years ago

@javh Thanks. Will work up a fix within ogrdb.

schristley commented 2 years ago

Right now (and pre-hack), the validator is dependent upon a successful type check and only supports an unnamed list type, so if you were to assign repertoire_id as the key for each individual Repertoire then validation wouldn't work. This is even murkier in R because a list and dict are the name type in R.

I don't understand, why would assigning repertoire_id's cause validation to not work?

javh commented 2 years ago

I don't understand, why would assigning repertoire_id's cause validation to not work?

Before schema validation, the code does this (abbreviated):

p = 'Repertoire'
rep = data[p]
if not isinstance(rep, list):
    valid = False
    sys.stderr.write('%s contains %s but the data is not an array.\n' % (filename, p))
# All is well

Which is fine for a list. But, if it is keyed with repertoire_id, it will fail:

p = 'Repertoire'
rep = { x['repertoire_id'] : x for x in data[p] }
if not isinstance(rep, list):
    valid = False
    sys.stderr.write('%s contains %s but the data is not an array.\n' % (filename, p))
# All is not so well. Ugh. Help. I'm dying. Ahhhhhhhh.... hhhhh... hh.
schristley commented 2 years ago

I don't understand, why would assigning repertoire_id's cause validation to not work?

Before schema validation, the code does this (abbreviated):

oh, I see, I was never thinking about supporting the user constructed dictionary. Okay that seems reasonable, but can't they just do something like validate( rep.values() ) or something simple like that?

schristley commented 2 years ago

and if it helps, here some pseudo code (minus some error checking) on how I was thinking/designing the workflow

# read AIRR data file
data = airr.read_airr(file, validate=True)

# do some stuff with repertoires and germline sets, maybe update some fields
rep = { x['repertoire_id'] : x for x in data['Repertoire'] }
gs = { x['germline_set_id'] : x for x in data['GermlineSet'] }

# oh and add some cells
new_cell = airr_cell_template()
data['Cell'] = [ new_cell ]

# validate and write
airr.validate_airr(data)
airr.write_airr(file, data)
javh commented 2 years ago

Thanks, @schristley.

oh, I see, I was never thinking about supporting the user constructed dictionary. Okay that seems reasonable, but can't they just do something like validate( rep.values() ) or something simple like that?

Yeah, they could, but I think it would be nice to avoid that. And maybe allow people to save the keys to the json file if they assigned them. Also, just checking for a dict internally won't distinguish a single-record dict (with keys=properties) from a dict of records (keys=repertoire_id). That's why I was thinking of adding a definition field upon read to explicitly declare the type of rep.values(), instead of just doing a list vs dict type check (as a poor man's alternative to loading the data into an appropriated defined class).

But, I'll mess with it and see if it's worth the trouble. I think it's more important to get v1.4 out than it is to get this perfect right now.

That workflow makes sense to me. It's how the R package is structured. But, right now, the python package validate functions operate on files instead of objects in memory. I'll fix that so it's the same as R and users can validate their objects in memory without having to write them to a file. Just forgot about it until a couple days ago and haven't gotten back to this yet. :)

schristley commented 2 years ago

Thanks, @schristley.

oh, I see, I was never thinking about supporting the user constructed dictionary. Okay that seems reasonable, but can't they just do something like validate( rep.values() ) or something simple like that?

Yeah, they could, but I think it would be nice to avoid that.

Okay, it still feels kinda unnecessary to me but I'll wait and see. Maybe instead of validate( rep.values() ), I should have said airr.validate_object(data['Repertoire']) or such, right, because python is by-reference, the rep and data variables point to the same objects.

That workflow makes sense to me. It's how the R package is structured. But, right now, the python package validate functions operate on files instead of objects in memory.

That's not true. There's the validate_object function in the schema which works for all the schema objects we test. I almost feel it should work with DataFile without any change.

But, I'll mess with it and see if it's worth the trouble. I think it's more important to get v1.4 out than it is to get this perfect right now.

Don't spend time on it then, work on the other stuff, give me a chance to submit a PR for DataFile for v1.4 then, it's not that hard. I actually thought I had written the code, but maybe not...

javh commented 2 years ago

Okay, it still feels kinda unnecessary to me but I'll wait and see. Maybe instead of validate( rep.values() ), I should have said airr.validate_object(data['Repertoire']) or such, right, because python is by-reference, the rep and data variables point to the same objects.

I think you were right the first time. data['Repertoire'] if it's a list. data['Repertoire'].values() if it's a dict.

That's not true. There's the validate_object function in the schema which works for all the schema objects we test. I almost feel it should work with DataFile without any change.

I meant the interface functions. validate_airr in python expects a file path. validate_airr in R expects an AIRR Data Model object (list). Easy fix.

Don't spend time on it then, work on the other stuff, give me a chance to submit a PR for DataFile for v1.4 then, it's not that hard. I actually thought I had written the code, but maybe not...

There was some code referring to DataFile explicitly, but I generalized it. Again, in part, because I started with the R package. If DataFile works how I think it does, then it should work as is and no changes are needed. If DataFile isn't what I think it is, then I'd rather not sort it out for v1.4. Regardless, the DataFile schema can't be parsed in R right now, because Tree can't be parsed. We should do a separate PR for DataFile though - just not for v1.4.

Maybe we can have a quick call next week? I think it'll be easier to talk through than type through. We might be getting a bit lost in the weeds. I don't think any of this is complicated to implement.

bussec commented 2 years ago

@javh: Apologies, but I would like to ask for some clarification, as I could not find the examples you shared during the call on Monday. So just to be on the same side:

Originally, we were expecting top-level entities (e.g., Repertoire, GermlineSet) to all be arrays and look like this:

{
  "TopLevelEntity": [
    { 
      "object_id": "72b78ce1-49c2-46f2-80fa-ae877a72b35a",
      "object_property_foo": "bar"
    },
    { 
      "object_id": "d338d96e-b085-4b73-89ae-7e6ac70dd77e",
      "object_property_foo": "bar"
    }
  ]
}

Then we realized some top-level entities are instead objects:

{
  "TopLevelEntity": {
    "object_id": "72b78ce1-49c2-46f2-80fa-ae877a72b35a",
    "object_property_foo": "bar"
  }
}

To avoid additional case-distinctions in the validation code, we therefore decided that every top-level entity should be a list, even if it only contains just a single item.

Then we realized that it would be nice if we could access the individual items in the list via a key (usually their object_id), which would then turn the top-level entity into an object itself:

{
  "Repertoire": {
    "72b78ce1-49c2-46f2-80fa-ae877a72b35a": {
      "object_id": "72b78ce1-49c2-46f2-80fa-ae877a72b35a",
      "object_property_foo": "bar"
    },
    "7d338d96e-b085-4b73-89ae-7e6ac70dd77e": {
      "object_id": "d338d96e-b085-4b73-89ae-7e6ac70dd77e",
      "object_property_foo": "bar"
    }
  }
}

And finally, we were discussing the addtion of an definition property, that would allow the determination of the applicable schema, so that this is not encoded in the name of the top-level entity anymore:

{
  "MyLittleUnicornsRepertoire": {
    "definition": "Repertoire",
    "72b78ce1-49c2-46f2-80fa-ae877a72b35a": {
      "object_id": "72b78ce1-49c2-46f2-80fa-ae877a72b35a",
      "object_property_foo": "bar"
    },
    "7d338d96e-b085-4b73-89ae-7e6ac70dd77e": {
      "object_id": "d338d96e-b085-4b73-89ae-7e6ac70dd77e",
      "object_property_foo": "bar"
    }
  }
}

Did I capture all of this correctly or is there any aspect that I am missing?