Uniqueness of _id fields in airr_schema.yaml

bcorrie commented 5 years ago

@schristley I was looking at several of the _id fields in the schema, and I note in the descriptions we do not mention uniqueness criteria for many (any?) of them. I think this is a problem, isn't it??? Am I missing something?

If I go to the rearrangement level, we have several _ids (pair_id, clone_id, cell_id, rearrangement_id, repertoire_id, and data_processing_id). We don't advise or specify at what level something like a clone_id is unique... Or even a repertoire_id or data_processing_id. MiAIRR specifies that study IDs should be unique (typically an INSDC study related identifier) with subject_ids and sample_ids unique within studies.

It is not well defined what the relationship between _ids is from this level down (pair_id, clone_id, cell_id, rearrangement_id, repertoire_id, and data_processing_id)

One can probably infer (if you know the AIRR spec well) that repertoire_id should be unique at least within a study, maybe a subject. The reality is that the repertoire_id should be unique at the repository level (as they are the IDs returned by the /repertoire API endpoint), but that isn't actually stated in the spec unless I am missing something...

data_processing_id should be unique withing a repertoire_id at least. It feels like data_processing_id should be unique at the repository level as well, so you can easily identify a set of rearrangements that have been processed with the same data_processing without having to do a combined repertoire_id x data_processing_id query but again, nothing is explicitly stated in the spec.

pair_id, clone_id, and cell_id should probably be unique at least unique within a repertoire_id/data_processing_id pair. If data_processing_id is unique within the repository, then it is sufficient to say unique within the data_processing_id.

Finally, rearrangement_id should be unique to the repository as well, that is it is the internal identifier for the repository for a single rearrangement entry. This is the only one that states anything about uniqueness at the moment.

Should we review this?

schristley commented 5 years ago

descriptions we do not mention uniqueness criteria for many (any?) of them

The global uniqueness for repertoire_id and uniqueness of data_processing_id within a repertoire is in the documentation for the repertoire schema (look under Linking Data)

http://docs.airr-community.org/en/metadata-docs/datarep/metadata.html

but you are right that most of the other _ids aren't well specified.

schristley commented 5 years ago

MiAIRR specifies that study IDs should be unique

I'm not sure this is true, where does it say that?

javh commented 5 years ago

Yeah, they don't seem well documented. For pair_id, clone_id, and cell_id do you mean "unique" in the sense of "a uniquely identifiable clone_id/cell_id/pair_id represents all rows assigned to the same clonal cluster/cell/receptor"? By definition, they won't be unique in the same sense as sequence_id which is a 1-to-1 relationship with id-to-rows, as they are 1-to-many.

Ie, if we change the wording of cell_id from:

Identifier defining the cell of origin for the query sequence.

To:

Identifier uniquely defining the cell of origin for the query sequence.

Does that address the concern? Do we need to specify within the same rearrangement_id, repertoire_id, or file?

bussec commented 5 years ago

MiAIRR specifies that study IDs should be unique

I'm not sure this is true, where does it say that?

It is not stated explicitly for MiAIRR in general, but the NCBI implementation requires mapping of study_id to BioProject's Project/ProjectID/ArchiveID/accession attribute, which is a UID (see here).

schristley commented 5 years ago

NCBI implementation requires mapping of study_id to BioProject's Project/ProjectID/ArchiveID/accession attribute

Okay, right, so technically it is unique with a data repository and it could (potentially) be globally unique if those repositories have id's that don't conflict.

bcorrie commented 5 years ago

MiAIRR specifies that study IDs should be unique

I'm not sure this is true, where does it say that?

It is not stated explicitly for MiAIRR in general, but the NCBI implementation requires mapping of study_id to BioProject's Project/ProjectID/ArchiveID/accession attribute, which is a UID (see here).

We have this statement:

1 / study   Study   string  Free text   Unique ID assigned by study registry    PRJNA001    study_id

in: https://github.com/airr-community/airr-standards/blob/metadata-docs/AIRR_Minimal_Standard_Data_Elements.tsv

Assuming "study registry" is an INSDC repository, then I think we have uniqueness don't we?

bcorrie commented 5 years ago

The global uniqueness for repertoire_id and uniqueness of data_processing_id within a repertoire is in the documentation for the repertoire schema (look under Linking Data)

OK, I had missed that... I think that this should probably be mentioned in the "description" of those fields in the spec, no? I have added some of this to the repertoire_id "description" in the spec file. This is quite an important link between the two API entry points, so I think it should be clear...

schristley commented 5 years ago

added some of this to the repertoire_id "description" in the spec file

Yeah, that's fine for now. In #219, I mention to @bussec about some fields having really long descriptions and that kinda makes the table look not so great, repertoire_id is kinda on the edge of being obnoxious, but we can probably trim it down to be more concise. I think some of this stuff needs to be put in a Definitions Clarification in the docs, like was done with the Rearrangement schema, versus trying to cram it all into the description.

http://docs.airr-community.org/en/metadata-docs/datarep/metadata.html#repertoire-fields

bcorrie commented 5 years ago

data_processing_id should be unique withing a repertoire_id at least. It feels like data_processing_id should be unique at the repository level as well, so you can easily identify a set of rearrangements that have been processed with the same data_processing without having to do a combined repertoire_id x data_processing_id query but again, nothing is explicitly stated in the spec.

So what about this case? From a repository optimization perspective, it would be VERY useful to be have data_processing_id to be unique at the same level as repertoire_id (unique within a repository). When one queries at the rearrangement level for the set of rearrangements it would be nice to be able to query directly for just the rearrangements for a repertoire processed with a specific tool (a specific data_processing_id).

In fact, I would argue that the rearrangement query that would be most common would be queries at the data_processing_id level, and one would rarely be searching rearrangement data for a specific repertoire_id as a single set of data with different data_processing applied (e.g. MixCR and igblast with the annotations not separated by data_processing_id). It is more likely you would be asking for a single data_processing_id from within each repertoire that you are interested in. For example, I think common rearrangement query scenarios would be, for a specific set of repertoire_ids that I am interested in:

I want all of the "primary" processed rearrangements for each repertoire. That is, give me all of the rearrangements from all of my repertoires of interest where the "data_processing_id" is the "primary_annotation" for each repertoire. If there is only one data_processing, then that is by default the "primary_annotation"
I want all of the "MiXCR" annotated data. That is, give me all of the rearrangements from all of my repertoires of interest where the "data_processing_id" is the data that has been annotated by MiXCR. That is data_processing.software_versions contains "MiXCR".

These are all building lists of data_processing_ids to search on, and you almost always want to be using a single data_processing object from a repertoire (correct me if I am wrong). The main time you wouldn't want to have one data_processing_id is if you were comparing between data_processing_ids within a single repertoire (comparing the results of MiXCR vs igblast). Even in this case, you would want to split the rearrangement data between the data_processing_ids so you could separate the MiXCR and igblast data for comparison.

In the cases where there is only one data_processing object, we state that one should use a repertoire_id rather than a data_processing_id. I think this could get quite cumbersome, as then you are generating queries that have a mix of repertoire_id (if there is only one data_processing object in the repertoire) and data_processing_id (if there is more than one data_processing object in the repertoire).

In most cases it seems to me that using data_processing_id rather than repertoire_id will be the rearrangement query of choice. If that is true, we want to optimize our searches at least as well for data_processing_id as we do for repertoire_id. Having data_processing_id be unique at the repository level would help enormously with this...

schristley commented 5 years ago

...

man, that's a lot of words, do I really need to read all that? Did you just have a shot of espresso? ;-D

I'm not against data_processing_id being unique within repository, I guess I'm also okay with it being globally unique like repertoire_id but neither seem to be needed for the common query scenarios that you mention.

Just remember that a data_processing_id won't necessarily get you the rearrangements for all the repertoires in study, it will only get you the repertoires that were processed the same. There is nothing preventing users from processing repertoires within a study differently. So from that perspective, you will likely need repertoire_id to include/exclude the proper repertoires, and yeah use repertoire_id and data_processing_id as a combo key.

From an implementation perspective, the data_processing_id ends up being unique within VDJServer. This is because we store the analysis provenance as an individual object in the database so it gets a uuid. But, for example, VDJServer processes B and T cells differently, so in a combined study like the Florian study, the B cell rearrangements have a different data_processing_id from the T cell rearrangements.

bcorrie commented 5 years ago

...

man, that's a lot of words, do I really need to read all that? Did you just have a shot of espresso? ;-D

Yeah, sorry, I was challenged to try and capture the problem clearly 8-)

bcorrie commented 5 years ago

From an implementation perspective, the data_processing_id ends up being unique within VDJServer.

Same for iReceptor, and this seems to be really useful, and was one of the drivers for my question. In addition, it looks like the iReceptor Gateway will be extracting data_processing_id from Repertoires and generating rearrangement queries using data_processing_id and NOT repertoire_id. Given the above, it seems to me that there are good reasons to make it a "unique within repository" id and not too many against...

bcorrie commented 5 years ago

But, for example, VDJServer processes B and T cells differently, so in a combined study like the Florian study, the B cell rearrangements have a different data_processing_id from the T cell rearrangements.

Would these two data_processing objects (one for B cells and one for T cells) be in the same Repertoire in your API response?

Would it be possible for you to generate an example /airr/v1/repertoire response for a single repertoire that would have this structure. I think we understand what this would look like, but having a concrete example for us to work with from a Gateway presentation layer would be very helpful!!!

As far as I have seen, the repertoire responses on the docs pages only have a single data_processing object for each repertoire.

schristley commented 5 years ago

Would these two data_processing objects (one for B cells and one for T cells) be in the same Repertoire in your API response?

No.

having a concrete example

look at the florian example data:

https://github.com/airr-community/airr-standards/blob/master/lang/python/examples/florian.airr.yaml

or

the test data set as I've enhanced it somewhat:

https://github.com/airr-community/adc-api-tests/blob/master/datasets/florian/florian.airr.yaml

bcorrie commented 5 years ago

Just remember that a data_processing_id won't necessarily get you the rearrangements for all the repertoires in study, it will only get you the repertoires that were processed the same. There is nothing preventing users from processing repertoires within a study differently. So from that perspective, you will likely need repertoire_id to include/exclude the proper repertoires, and yeah use repertoire_id and data_processing_id as a combo key.

I think my main point in my rambling above was that it seemed to me that one would almost never do a search at the rearrangement level for a repertoire_id EXCEPT in the case where there was only one data_processing object.

The reason for this is that if any Repertoire has more than one data_processing object, when looking for rearrangements for that Repertoire you are almost always going to want to be explicit about which rearrangements you are retrieving (how they were processed and therefore which data_processing_id), otherwise the rearrangements returned will be very confusing! In my examples above where a Repertoire has more than one data_processing object, you would almost always want either the rearrangements from the "primary_annotation" or the rearrangements that have been processed in a specific way (e.g. by an explicit tool such as MiXCR).

If you have to search by data_processing_id for some rearrangements from some Repertoires, then it makes sense to be consistent and always search for data_processing_id even when there is only one data_processing object.

bcorrie commented 5 years ago

Would these two data_processing objects (one for B cells and one for T cells) be in the same Repertoire in your API response?

No.

OK... Too bad in a way, as we are looking for a concrete example where this would occur in a study...

Currently, as far as I know, all of our data (meaning IPA and VDJServer) has Repertoires with single sample and single data_processing objects. This is easy... The iReceptor Gateway has to handle the situation when a Repertoire can have either an array of sample objects or an array of data processing objects (or both), and it is very unclear to us when this would occur, how this should be presented to the user, and how queries about the rearrangements in such a Repertoire should be generated.

schristley commented 5 years ago

I think my main point in my rambling above was that it seemed to me that one would almost never do a search at the rearrangement level for a repertoire_id EXCEPT in the case where there was only one data_processing object.

Incorrect, you will almost ALWAYS want to use a repertoire_id AND a data_processing_id to the get rearrangements that you want. It's only in the special case when the repertoire has just a single data_processing that you can leave data_processing_id out.

The reason for this is that if any Repertoire has more than one data_processing object...

You are latching onto the scenario of multiple data_processing objects, I agree with all your points about that scenario. But in that scenario, you seem to be indicating that the repertoire_id is not relevant, and that's incorrect. So here is a contrived example:

Given a study that has 10 repertoires. 5 healthy control repertoires and 5 cancer repertoires. They all have a single data_processing object.

Now a user comes along, they do a query for all healthy repertoires, they get those 5 out of 10 repertoires from that study (plus presumably repertoires from other studies).

Now if you do a query on the rearrangements using ONLY the data_processing_id, you will get rearrangements for all 10 repertoires, which is wrong. The only way to get the correct rearrangements is to query on those 5 repertoire_ids AND the data_processing_id.

So the repertoire_id is always needed when querying the rearrangements, that's how the API was designed!

This is regardless of whether the data_processing_id is unique or not. The uniqueness doesn't guarantee that you get the proper repertoires.

schristley commented 5 years ago

So the repertoire_id is always needed when querying the rearrangements, that's how the API was designed!

That's assuming the standard workflow where you query metadata first to get a list of repertoires, then query rearrangements. Of course, you can also go the other way and query rearrangements first to get a list of repertoires, then lookup their metadata, like if doing a straight CDR3 search.

schristley commented 5 years ago

The iReceptor Gateway has to handle the situation when a Repertoire can have either an array of sample objects or an array of data processing objects (or both), and it is very unclear to us when this would occur, how this should be presented to the user, and how queries about the rearrangements in such a Repertoire should be generated.

The array of sample objects is useful for display/query purposes on the repertoire metadata, but becomes irrelevant when querying rearrangements because those samples all collapse into a single repertoire_id.

The array of data processing object is relevant, and needs to be handle because in general, when you query a bunch of studies, they are all going to have different data processing. So how are the users going to decide which ones they want?? This gets to one of the fundamental questions we've been debating in iR+, if everything is processed differently...

bcorrie commented 5 years ago

Given a study that has 10 repertoires. 5 healthy control repertoires and 5 cancer repertoires. They all have a single data_processing object.

Do you mean that there is:

one data_processing object (and therefore one data_processing_id) in the entire study
all 10 repertoires have a single data_processing object
all 10 repertoires refer to the same data_processing object by referring to the same data_processing_id

In this case, all the rearrangements in this study also have the same data_processing_id.

Correct??? 8-)

schristley commented 5 years ago

Do you mean...

Yes to all. I kept it simple. Did you understand my point?

Now if you do a query on the rearrangements using ONLY the data_processing_id, you will get rearrangements for all 10 repertoires, which is wrong. The only way to get the correct rearrangements is to query on those 5 repertoire_ids AND the data_processing_id.

Unless you are going to be pedantic and say "you don't need AND data_processing_id in that case because there is only one" then I would say yes yes that isn't the point I was trying get across.

bcorrie commented 5 years ago

Do you mean...

Yes to all. I kept it simple. Did you understand my point?

Yes, but I think this is where my confusion originally stemmed from and is similar to the reason why I was suggesting that we should change it so data_processing_id be unique to the repository. The uniqueness criteria of these _id fields are still very fuzzy.

Your example above, as I described it, uses a single data_processing_id to be referred to by several independent repertoires, which requires a data_processing_id that is unique across the repository. The current spec/docs do not allow for this. It doesn't stop you using the same data_processing_id for multiple repertoires, but it doesn't enforce the fact that they are the same nor does it restrict another repertoire from reusing the same data_processing_id for a completely different data_processing process (http://docs.airr-community.org/en/metadata-docs/datarep/metadata.html):

The data_processing_id is only unique within a Repertoire so repertoire_id should first be used to get the appropriate Repertoire object and then data_processing_id used to acquire the appropriate DataProcessing.

With our current definition of requiring a data_processing_id to be unique within a repertoire, your example above works because it is the repertoire_id, data_processing_id pair that is unique. The fact that the data_processing_id is the same across them all doesn't really have an impact. If this is the case, the argument for making it unique across the repository is probably not that important...

I think what I was looking for in suggesting uniqueness for data_processing_id was a unique repository wide identifier for each repertoire_id, data_processing_id pair. I was looking for a single _id that I could use to get all of the rearrangements for a specific repertoire and a specific data processing as applied to that repertoire VERY efficiently. As you say, that is not what a data_processing_id is!

In hindsight, I think it best to leave that optimization to being an internal repository optimization if desired/required. A repository can implement having unique data_processing_ids (I think VDJServer does/will). A specific researcher could build a single data_processing object and reuse it. And a repository could create an internal compound index on repertoire_id, data_processing_id to optimize rearrangement looks ups.

I don't think the spec and the API are the places to enforce any of these. Maybe we don't need to change how data_processing_id is defined.

bcorrie commented 5 years ago

The array of sample objects is useful for display/query purposes on the repertoire metadata, but becomes irrelevant when querying rearrangements because those samples all collapse into a single repertoire_id.

Can you give me a concrete example of how this would be used? I don't follow the use case of when you would have multiple samples in a Repertoire and how one would map rearrangements to that repertoire... I understand the use case of multiple data_processing objects in a Repertoire, but not the multiple sample objects in a Repertoire.

schristley commented 5 years ago

Yes, but I think this is where my confusion originally stemmed from and is similar to the reason why I was suggesting that we should change it so data_processing_id be unique to the repository.

Okay, good, I was having difficulty coming up with a clear example to explain that it was immaterial whether data_processing_id was unique to the repository or not.

schristley commented 5 years ago

because it is the repertoire_id, data_processing_id pair that is unique.

Correct, and that is the case for the other _ids as well: clone_id, cell_id and pair_id.

In some sense, rearrangement_id could be like those as well but because we have an explicit API entrypoint for it, it needs to be unique at the repository level.

schristley commented 5 years ago

Can you give me a concrete example of how this would be used?

It's a contrived example though not completely crazy. Let's say a study with one subject where the patient goes through a treatment. Initially a single blood draw which is sequenced and becomes a single pre-treatment sample. So at this point, we have a single repertoire with a single sample.

Some time later the patient is treated, and at that time another blood draw is taken, but also a tissue sample is taken, both are sequenced. In particular, the tissue sample has a disease_state_sample: cancer, while the two blood samples have disease_state_sample: null because the "histopathologic evaluation" indicates the blood is normal.

Now the researcher wants to analyze all three samples together, say to extract common clones, so creates a single repertoire object with three samples. Very concisely the repertoire looks like this:

repertoire:
  repertoire_id: some-id
  sample:
    - sample_id: blood pre-treatment
    - sample_id: blood post-treatment
    - sample_id: tissue post-treatment
      disease_state_sample: cancer
  data_processing:
    - data_processing_id: 1
      primary_annotation: true

The study is published, the data is made public. Now somebody comes along and does a query for repertoire with cancer samples, something like this:

{ filter: { "op:"=", content: {"field":"sample.disease_state_sample", "value": "cancer"}}}

So I hope you agree that this repertoire will show up in the query results.

Now if that person looks at the repertoire, the UI will show them it has three samples, and they look at them in detail and say oh, its two blood samples and one tissue sample combined together for analysis. Then they make some decision on whether they want to use the rearrangements from that repertoire or not. If they do, then they query the rearrangement entrypoint with the repertoire_id and the data_processing_id.

Is this an example you are looking for?

bcorrie commented 5 years ago

Yes, that is great... thanks... We are trying to determine what level of data should appear on what we used to call our "samples" page and is now called our "repertoire" page. Currently, we display samples on the repertoire page... This is fine at the moment, because all of our data has one sample per repertoire and one data processing per repertoire.

But there are a bunch of different ways you could handle that in the general schema case...

one row per repertoire
one row per sample
one row per data_processing
one row per data_processing/sample pair

Once we get that sorted, we then need to figure out what to do to get the rearrangements for the entities that you decide you are interested from the above list. Essentially, we need to generate a query with repertoire_ids and data_processing_ids. I think I can see how you would do most combinations above, but...

Lets say I am a researcher and I want data from blood samples where the disease state of the sample is cancer and I want post treatment data only. So I only want the rearrangements from one of the samples in the Repertoire. In this example, I don't see a way to do that by querying the rearrangements API entry point... Even a repertoire_id/data_processing_id pair does not allow me to differentiate the rearrangements between the samples, so I can't get just the rearrangements from the blood post treatment sample...

schristley commented 5 years ago

So I only want the rearrangements from one of the samples in the Repertoire. In this example, I don't see a way to do that by querying the rearrangements API entry point... Even a repertoire_id/data_processing_id pair does not allow me to differentiate the rearrangements between the samples, so I can't get just the rearrangements from the blood post treatment sample

Correct, in general that is not possible. That's all the gory details from #181 (if you re-read that, change the old "rearrangement set" terminology to "data_processing_id")

bcorrie commented 5 years ago

Nooooooooooo, not #181

Maybe this is why I have been so confused... In jumping to the end of #181 we discuss having a sample_processing_id, and for each rearrangement I suggested having "... (three identifiers, RepertoireID, SampleProcessingID, and SoftwareProcessingID)" This seemed to have pretty general consensus.

In our current spec we have repertoire_id (RepertoireID) and data_processing_id (SoftwareProcessingID). What happened to SampleProcessingID? At the end of the issue you mentioned having a sample_processing_id that would be sufficient for many cases, but we don't have that in our current spec? I think this needs to be added, no?

schristley commented 5 years ago

What happened to SampleProcessingID?

That was put in to make you happy because you kept demanding to be able to do this mapping. I'm not interested but feel to spec/doc/code it out.

As you say:

all of our data has one sample per repertoire and one data processing per repertoire

so how often do you intend to use it?

schristley commented 5 years ago

pair_id, clone_id, and cell_id should probably be unique at least unique within a repertoire_id/data_processing_id pair

so we got distracted by data_processing_id and didn't discuss these.

These are bit challenging because they are biological ids versus computer data relations, i.e. they represent biological physical things: cells, paired chains and clones.

From a computer perspective, they need to be unique within a repertoire_id just so that they can be stored together under a single repertoire data structure without conflict. However, it is important to note that that uniqueness is not a biological uniqueness. It's perfectly reasonable that the same (say) cell_id is used across multiple repertoires to indicate that is exactly the same biological cell.

schristley commented 5 years ago

@bcorrie I think this is just a documentation issue now

bcorrie commented 5 years ago

@schristley agreed... I think the following have uniqueness documented in the YAML spec:

sample_processing_id
repertoire_id

The following don't say anything about uniqueness...

data_processing_id - assume this should be unique within the repertoire???
cell_id ???
pair_id ???
clone_id ???

The biological ones I am going to leave to someone else to suggest... 8-)

schristley commented 5 years ago

@bcorrie We can add some info about data_processing_id. The biological ones are in flux, in our recent discussion, it sounds like clone_id might be globally unique, also cell_id might be as well based upon the single cell discussions.

bcorrie commented 5 years ago

Just to confirm, when you say globally unique, you don't mean that each clone_id has to have a unique global identifier (a UUID), but rather that it can be differentiated globally using other information (e.g. clone_id+repository). I think that is true for repertoire_id as well, correct?

If that is the case, should we describe this as unique within the repository?

bcorrie commented 5 years ago

Based on discussion at the CRWG meeting, my suggestion would be to have _id fields that need uniqueness to restrict that uniqueness to the repository (in the case of a repository and in the context of an API call) or to the file (in the case of the file format in the context of processing the data) at least for V1. Global uniqueness I think requires more discussion.

Given the above, I use reperotire_id to make some statements below, please feel free to challenge them... 8-)

Statement 1: For a repertoire JSON file each Repertoire in the file should have a unique repertoire_id.

Statement 2: For an AIRR TSV file, if two rearrangements in a file have the same repertoire_id, then they belong to the same Repertoire.

Statement 3: In order to have a file based AIRR compliant data set, it is necessary to have at least one JSON file for the Repertoire metadata and one AIRR TSV file for the rearrangement data. In these two files, repertoire_id links a rearrangement from the AIRR TSV file to a specific repertoire in the repertoire JSON file.

Statement 4: We currently have no mechanism to link repertoire and rearrangement data files, it is left as an exercise to the user to track which repertoire files relate to which rearrangement files.

Does anyone disagree with the above 4 statements as a baseline?

schristley commented 5 years ago

Just to confirm, when you say globally unique, you don't mean that each clone_id has to have a unique global identifier (a UUID), but rather that it can be differentiated globally using other information (e.g. clone_id+repository). I think that is true for repertoire_id as well, correct?

@bcorrie I mean neither exactly. I mean it is a unique global identifier, but it doesn't have to a UUID per se. It can be defined like you suggest (id+repository).

If we don't make repertoire_id (and clone_id ?) globally unique, it's going to introduce a large number of problems for end users. They won't be able to combine data from multiple repositories without doing a bunch of extra work like:

verify the repertoire_ids are unique
If they are not unique, come up with new unique ids for them and assign them so that tools don't get confused about which is which.
Rewrite all the rearrangement data with the new unique repertoire ids.
because those unique ids are different, you cannot use them to go back to the repository, so maintain a mapping file from your unique ids back to their original ids in the repository.
Likewise you also cannot report your new ids in publications, etc., so you need to map back to the original if you are going to publish them.
and so on...

And it's fairly easy for the repository to make their ids unique, so I think that little extra work on the repository side has big benefits for users.

bcorrie commented 5 years ago

@bcorrie I mean neither exactly. I mean it is a unique global identifier, but it doesn't have to a UUID per se. It can be defined like you suggest (id+repository).

It just isn't very clear to me how we do this without either enforcing a UUID (something that is truly globally unique) or coming up with a concise definition of how one makes something like "repertoire_id" globally unique (which seems really hard in the general case). This is hard because repertoire JSON files and their associated rearrangement TSV files might come from many sources. Something like id+repository only works in the case of a limited context like that of the AIRR Data Commons. Not sure if you were thinking to extend this concept beyond the AIRR Data Commons repositories, but I think we need to think of this from a file format data sharing perspective and not just the ADC API query perspective.

A concrete example... Lets say I get three data sets (all AIRR compliant) from three sources on which I want to perform an analysis:

I get one data set that I found in an AIRR Data Commons repository and I download it using the ADC API (as a repertoire JSON and rearrangement TSV download)
I get one data set from researcher R1, who is a collaborator of mine. She generated that data with a set of AIRR compliant tools, and the data is therefore in a repertoire JSON file and a set of AIRR TSV files, one per repertoire. She shares the data files with me and I have downloaded those files.
I have my own data that I produced, and being AIRR compliant in dealing with my own data I store the repertoire metadata in a repertoire JSON file and an AIRR TSV file. I have a single JSON file and a single AIRR TSV file with all the data in a single file.

In each of the above cases it is absolutely necessary to have repertoire_ids be distinct within each of the three data sets above (this gets to my definition above). I think we all agree on this level of uniqueness for things like repertoire_id. The question is, are we suggesting that every repertoire_id in the above three data sets needs to be unique, making it possible to compare the data sets directly without processing.

Certainly, this analysis would be easier to do if each repertoire_id was globally unique. At the same time, it seems overly onerous to have to enforce this in each of the above cases??? Global uniqueness of repertoire_ids would mean that a tool that produces AIRR compliant repertoire JSON and rearrangement TSV files would need to generate globally unique repertoire_ids. It also means that if I (as a researcher) use AIRR JSON and TSV to store my AIRR-seq data I also have to come up with a globally unique repertoire_id for my data to be AIRR compliant.

This seems very onerous at the tool level and/or at the individual researcher level (not sure if you were suggesting we go that far). At the same time, without enforcing this at all levels, it doesn't help much if the ADC data has global uniqueness, as the other two data sets don't. So we still have to do the processing you were talking about to compare data. If we can't ensure global uniqueness, then it feels to me like we are better off to recommend "data set" uniqueness (e.g. all data in a repository, all data stored for a researcher) for things like repertoire_id and be clear to researchers that there is some work to do if you want to compare data across AIRR JSON/TSV data sets.

bcorrie commented 5 years ago

verify the repertoire_ids are unique

If they are not unique, come up with new unique ids for them and assign them so that tools don't get confused about which is which.

Rewrite all the rearrangement data with the new unique repertoire ids.

These feel like standard data processing that needs to be performed when you are federating data. I agree that uniqueness would make this easier, but guaranteeing global uniqueness is difficult, in particular if you consider comparing AIRR JSON/TSV files from outside the AIRR Data Commons to those accessed by the ADC API.

bcorrie commented 5 years ago

because those unique ids are different, you cannot use them to go back to the repository, so maintain a mapping file from your unique ids back to their original ids in the repository.

Likewise you also cannot report your new ids in publications, etc., so you need to map back to the original if you are going to publish them.

These feel to me like data/analysis provenance, data publication (DOIs), and scientific reproducibility issues - which are extremely important, but seem to me should be deferred to a later version of the specification. How to do this properly I think is quite complex and needs a fair bit of consideration.

bcorrie commented 5 years ago

And it's fairly easy for the repository to make their ids unique, so I think that little extra work on the repository side has big benefits for users.

I agree that this might be straightforward (not 100% sure about that even), but I am not convinced that it solves the general problem where you get data from an ADC repository and AIRR compliant data from another non ADC API source and want to compare the data. 8-)

schristley commented 5 years ago

extend this concept beyond the AIRR Data Commons repositories

I'm primarily concerned with the AIRR Data Commons, the ADC API and AIRR compliant repositories. This what we have control over, and my comments about being "fairly easy" is in that context. You are correct that trying to define globally unique for all possible scenarios is difficult, and enforcing that doubly so. However, we can enforce it for the AIRR Data Commons because it is a "gated" community. A repository has to conform to some specifications before it can be stamped compliant. We have (supposedly) a registry of all these repositories, so we can check and enforce. We can coordinate so that everybody has unique ids.

I consider the situation as not much different from the current INSDC databases, both NCBI and EBI allow submission of data and they also share data between them, so they've clearly worked out a scheme to not generate conflicting id numbers.

When it comes to a researcher using tools on their data, they have control over that. They can pick simple repertoire_id's knowing the data won't be mixed with other data. If they find later that's a problem, they can reassign them to avoid conflict.

The issue is the researcher has no control over the AIRR data commons. If a repository returns identifiers that conflict with another repository, I feel that is a flaw in the AIRR Data Commons, it's not the fault of the researcher, and it seems onerous that the researcher is forced to check and fix that flaw themselves.

It's worth remembering that repertoire_id was added by CRWG (for the ADC API). MiAIRR didn't need it, the DataRep AIRR TSV initially considered it a database field that was irrelevant to file-based analysis. Even now, repertoire_id isn't really necessary for file-based command line tools. It is still primarily a field returned by the ADC API for linking metadata and rearrangement annotations.

@bcorrie Now maybe it's better to say "unique among all AIRR Data Commons repositories" instead of saying "globally unique"?

bcorrie commented 4 years ago

It's worth remembering that repertoire_id was added by CRWG (for the ADC API). MiAIRR didn't need it, the DataRep AIRR TSV initially considered it a database field that was irrelevant to file-based analysis. Even now, repertoire_id isn't really necessary for file-based command line tools. It is still primarily a field returned by the ADC API for linking metadata and rearrangement annotations.

That is true, but there used to be a rearrangement_set_id (v1.2.1) - certainly in the DataRep AIRR TSV. This grouped a set of rearrangements that needed to be grouped together in a file, typically grouping a set of rearrangements that belonged to a specific biological sample that had gone through a specific set of sample processing and data processing steps. This was required.

We now have three ID fields that replace that single ID so that you can choose all of the rearrangements that belong to either a specific repertoire, a specific sample processing regime, or a specific data processing regime. So something like repertoire_id was always required, and it needed to uniquely identify all rearrangement in a set in a file.

So at least the AIRR TSV file format needs some sort of ID like this. We just have three of them now (repertoire_id being one of them) rather than one so we can differentiate on several levels. repertoire_id, combined with one or both of sample_processing_id or data_processing_id replace the functionality of rearrangement_set_id from an AIRR TSV file perspective, no? (See https://github.com/airr-community/airr-standards/issues/246#issuecomment-531033147)

@bcorrie Now maybe it's better to say "unique among all AIRR Data Commons repositories" instead of saying "globally unique"?

Isn't the ADC API primarily an API for querying a single repository? I don't think of it as an API for the entire AIRR Data Commons, I think of it as an API for querying a single repository in the AIRR Data Commons. It seems to me that any uniqueness criteria for a call against a API for a specific repository/service should either be globally unique (which I would prefer not - at least not yet - as I think this requires more thought) or unique within that repository only. The most straight forward thing to do is to make the repository responsible for making sure repertoire_id is unique with that repository. At least for V2.0 of the release.

If we require uniqueness within the AIRR Data Commons, I think we have to provide either a documented mechanism on how to implement that uniqueness or an actual mechanism (a web service) to acquire an unique ID that is know to be unique in the AIRR Data Commons. Given that repositories that implement the API might come and go I think this is quite challenging.

schristley commented 4 years ago

Isn't the ADC API primarily an API for querying a single repository? I don't think of it as an API for the entire AIRR Data Commons, I think of it as an API for querying a single repository in the AIRR Data Commons.

There is nothing in the OpenAPI spec that allows a uniqueness criteria to be specified on a field, so what we are talking about is beyond the API. It is part of the extra stuff to have an AIRR-compliant repository to be part of the AIRR Data Commons. Just because a repository implemented the ADC API, that doesn't magically make them AIRR-compliant and part of the AIRR Data Commons. The ADC API is only one thing, and there are other "community norms" that a repository must conform to in order to be part of the AIRR Data Commons. In particular, look at Recommendation 8 which includes the clause:

a system for assigning unique identifiers that ensures coordination among repositories/registries, for example, the system used by the OBO Foundry to coordinate ontology term identifiers across orthogonal ontologies.

If we require uniqueness within the AIRR Data Commons, I think we have to provide either a documented mechanism on how to implement that uniqueness or an actual mechanism (a web service) to acquire an unique ID that is know to be unique in the AIRR Data Commons. Given that repositories that implement the API might come and go I think this is quite challenging.

Right now, we have a documented mechanism, we can make that more precise. I'm still not sure why you think it is so challenging. Either you are over-complicating it or are being too expansive. The simplest technique (which is what is documented) is to use a repository unique prefix code, like "ipa" or "vdjs" or something, then attach that to repository unique number or code, so "vdjs-1", "vdjs-2" and so on.

Register your data repository with the AIRR Community. Suggest a repository unique prefix.
As we have a "registry" of repositories in the AIRR Data Commons, compare the prefixes to insure they are unique. Record the new repository with its assigned prefix.
If we ever find repertoires with identical ids, we should work with each data repository to resolve the conflict.

schristley commented 4 years ago

A related issue came up in #320 with new schema objects. While having an identifier such as repertoire_id allows the data to be linked, that repertoire_id doesn't indicate from what ADC repository the data came from (assuming it did). The same will be true for other identifiers like rearrangement_id, cell_id, clone_id, etc.

For example, if a user gets a file containing Cell data, but is missing the rearrangement data, in theory the user could query the data repository with the rearrangement_ids in the Cell data to get it, but the user needs to know which data repository to query, which isn't provided in our schema.

bcorrie commented 4 years ago

Right now, we have a documented mechanism, we can make that more precise. I'm still not sure why you think it is so challenging. Either you are over-complicating it or are being too expansive. The simplest technique (which is what is documented) is to use a repository unique prefix code, like "ipa" or "vdjs" or something, then attach that to repository unique number or code, so "vdjs-1", "vdjs-2" and so on.

I think the problem is we are combining multiple roles for repertoire_id. If we need to differentiate such things in an API response, I would prefer to have a separate field in the ADC API response rather than conflate the repertoire_id to capture two different concepts. In the model you are suggesting you are combining the bioinformatic concept of Repertoire with the technology concept Repository. This seems very messy to me...

The ADC API could just as easily have a separate field in the response that provided this information that looked something like this:

"Repertoire": [
  {"repertoire_id":"4357957907784536551-242ac11c-0001-012","repository_id":"vdjs1", ...}
]

and

  "Rearrangement":
  [
    {
      "rearrangement_id":"5d6fba725dca5569326aa104",
      "repertoire_id":"1841923116114776551-242ac11c-0001-012",
      "repository_id":"vdjs1",
      "... remaining fields":"snipped for space"
    }
  ]

I don't think we want repositories and API responses changing fields in the specification, in particular changing fields that might be provided by a researcher.

For example, think of this from a DataRep perspective. I, as a researcher, want to use AIRR Repertoire JSON and Rearrangement TSV files to document a study (much like you have done for the Florian study). I want to use standards to document my study in a AIRR compliant way, in particular so I can use AIRR compliant tools to process my data. I manually choose repertoire_id names that are meaningful to me as a researcher. They are unique in my study, and allow me to map rearrangements in my Rearrangement TSV files to my repertoire metadata in my Repertoire JSON file.

Using the AIRR formats in this use case scenario doesn't require any change from a researcher. In fact, they can go from this simply use case all the way to loading the data into an ADC repository and operating on federated data transparently, without any of the Repertoire metadata needing to change. The only change required by being able to work on federated data globally is the addition of another field.

In fact, if we really wanted to do this right, we would have a DOI for each AIRR Repository (make that a condition of being AIRR compliant) and then we could have:

"Repertoire": [
  {
      "repertoire_id":"4357957907784536551-242ac11c-0001-012",
      "repository_doi":"https://doi.org/10.25504/FAIRsharing.ekdqe5", ...
  }
]

schristley commented 4 years ago

In fact, if we really wanted to do this right, we would have a DOI for each AIRR Repository (make that a condition of being AIRR compliant) and then we could have:

Yeah, after #320 I've started thinking this route too. Though my thought was to provide a DOI for the repertoire versus a DOI for the repository

"Repertoire": [
  {
      "repertoire_id":"4357957907784536551-242ac11c-0001-012",
      "repertoire_doi":"https://vdjserver.org/airr/v1/4357957907784536551-242ac11c-0001-012", ...
  }
]

I'm not sure which is better. The important thing is that the fully qualified URL is available or can be constructed (we would need to document exactly how to do that).

Regardless, we still haven't resolved the issue that repertoires downloaded from two different repositories may have repertoire_ids that conflict.

We should discuss this in the CRWG meeting tomorrow and see if we can come to a solution.

bcorrie commented 4 years ago

Yeah, after #320 I've started thinking this route too. Though my thought was to provide a DOI for the repertoire versus a DOI for the repository

That is a lot of DOIs 8-)

bcorrie commented 4 years ago

Regardless, we still haven't resolved the issue that repertoires downloaded from two different repositories may have repertoire_ids that conflict.

My thought is that it is OK for repertoire_ids to conflict if we have another field for the AIRR Data Commons that makes a repertoire unique "globally" (at least unique in the ADC). repertoire_id is part of the informatic data model (a "DataRep" thing) and is something you need to make a study describable using the AIRR Standards. In this case, you don't need something globally unique.

If you are working at the AIRR Data Commons level and federating data from all over the place, then repertoire_doi (or whatever we call it) is the ADC thing that is necessary for the ADC to work. My concern is overloading one field to serve both purposes...

schristley commented 4 years ago

That is a lot of DOIs 8-)

haha true! Though I was meaning DOI in the general context of a digital object identifier and not the doi.org service...

So actually then repertoire_doi is semantically confusing as repertoire_id is the actual digital object identifier.

I think the problem is we are combining multiple roles for repertoire_id.

That wasn't my intent. I was just suggesting a scheme to construct a global identifier, similar to how SRA and ENA co-exist. Both accept raw sequence data, SRA prefixes its identifiers with SRP while ENA prefixes with ERP.

My thought is that it is OK for repertoire_ids to conflict if we have another field for the AIRR Data Commons that makes a repertoire unique "globally" (at least unique in the ADC). repertoire_id is part of the informatic data model (a "DataRep" thing) and is something you need to make a study describable using the AIRR Standards. In this case, you don't need something globally unique.

If you are working at the AIRR Data Commons level and federating data from all over the place, then repertoire_doi (or whatever we call it) is the ADC thing that is necessary for the ADC to work. My concern is overloading one field to serve both purposes...

Just be to clear, repertoire_id was devised by CRWG, not DataRep, not MiAIRR. So no, it's not a "DataRep thing". I understand how it seems that way now, and maybe you are right that it's been "taken over" by DataRep and used for a different purpose, but it was initially created as an "ADC thing that is necessary for the ADC to work." But it's not published yet and made into the standard, so CRWG can still decide what its purpose is and make any changes.

Now did we make an initial mistake with repertoire_id by considering it be just a simple identifier versus a fully qualified doi? Probably. Maybe you are right and we need two separate fields for two separate purposes. That I'm not so sure about, why not just make repertoire_id a fully qualified name? I never really liked that though because that just seems like a waste of space, especially when talking about rearrangements, but I still think globally uniqueness is extremely useful. Even more so, the CRWG recognized that as important as it put key provisions into the recommendations document for unique identifiers (specifically 8 and 9).

Now IEDB takes the two field approach.

"Epitope ID": 16878
"Epitope IRI": "http://www.iedb.org/epitope/16878"

but their data size is much smaller and they are a centralized database. We need to think a little more carefully about our distributed system as well as the data size.

If every rearrangement records has those two fields, that seems less than ideal.

airr-community / airr-standards

Uniqueness of _id fields in airr_schema.yaml #246