airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Location of germline_database key(s) in AIRR schema #183

Closed bussec closed 4 years ago

bussec commented 5 years ago

The current AIRR schema contains two germline_database keys, one in the Alignment another one in the Rearrangement object. However, this create a lot of redundancy, as usually all the Rearrangements in a Rearrangement_set would use the same reference. Therefore:

  1. Is there any scenerio in which we would need a germline_database key below the rearrangement_set level?
  2. Do we need more that a single reference, e.g. on for software processing and one to annotate a subject's individual germline reference (as suggested by @bcorrie)?

Note that this topic has been touched upon before in a number of other discussions like:

schristley commented 5 years ago
  1. Is there any scenerio in which we would need a germline_database key below the rearrangement_set level?

This would seem to imply that some sequences in a repertoire were processed with one germline DB while other sequences in the same repertoire were processed with a different germline DB, thus you would need to know at the rearrangement level which germline was used. This is technically possible but strikes me as odd or uncommon. One scenario this might happen is if T cell and B cell sequences are mixed together in the repertoire e.g. with RNA-seq or with single cell. RNA-seq is likely outside of our scope but I don't have a good intuition about single cell.

@bussec How do you see single cell in this context? Is a common case to sequence all lymphocytes so you get B cell and T cell together in the sequencing run, or does the single cell protocol in general expect that you pick one or the other?

  1. Do we need more that a single reference, e.g. on for software processing and one to annotate a subject's individual germline reference (as suggested by @bcorrie)?

Maybe not. For VDJServer, the holy grail is to process each repertoire with a subject specific germline to get the most accurate annotation. This might involve other germline databases during software processing, but ultimately I think a single germline DB would be used in the final rearrangement annotation. Any other germline DBs used during software processing can be recorded in the software processing description but I don't think they need to be explicitly listed.

schristley commented 5 years ago
  1. Is there any scenerio in which we would need a germline_database key below the rearrangement_set level?

One scenario is looking at how links will be followed. If the germline_database key is held in the Repertoire somewhere, then that requires valid metadata to be available to follow the links.

From a rearrangement, follow the id into the repertoire to follow the id into the germline database.

Rearrangement -> Repertoire -> GermlineSet

Now if Repertoire metadata is always available, then no problem. The question is do we want to support situations when metadata isn't available, or when people are too lazy to create metadata and just want to process files? Then the current links in Rearrangement allows that to work.

Rearrangement -> GermlineSet

I suppose we can support both, but then we create two decision paths that code has to check to determine how and which link to follow, though we can hide this in the reference library with a get_germline() function.

javh commented 5 years ago

The germline_database key in Alignment and Rearrangement are the same key, because Alignment and Rearrangement are alternative representations at the same level. If that makes any sense... They aren't in a hierarchy and the same solution should apply to both. Alignment is for summarizing multiple (ranked) V(D)J annotation hits and Rearrangement is for representing only the "best" one.

IIRC, the reason that germline_database was set as a per-sequence annotation was because the meta-data schema wasn't complete and this information was required.

I do think germline_database should be at the rearrangement_set (or Repertoire) level. I'm not overly worried about the case of having multiple germline databases within a single rearrangement_set, because I think that can be solved by how narrowly you define the germline_database. Ie, something like "IMGT 2019-03-19" covers both B and T cells, whereas "Bob's IGHV" would not. I don't recall us ever defining how granular the germline_database definition needs to be?

bussec commented 5 years ago

@bussec How do you see single cell in this context? Is a common case to sequence all lymphocytes so you get B cell and T cell together in the sequencing run, or does the single cell protocol in general expect that you pick one or the other?

Most studies will focus on either B or T cells. However, especially once you start using methods other than multiplex PCR (e.g. 5' RACE or whole RNA-seq), there is actually no technical reason not to look at B and T cells in the same experiment.

However, I am not sure whether I completely understand the problem with mixed B/T cell data sets in regard to germline databases: The calls e.g. for IGHV and TRBV can come from the same reference library as they represent different loci and are therefore no mutually exclusive (at least this is how we handle it).

schristley commented 5 years ago

However, I am not sure whether I completely understand the problem with mixed B/T cell data sets in regard to germline databases

The issue comes in during data processing. I'll pick on IgBlast, if you combine the IG and TR germline genes together into a single BLAST db and run IgBlast on a set of mixed sequences, you have the chance of some BCR being aligned to TR gene and some TCR being aligned to IG genes. Now I'm not sure how prevalent this is, or whether the mis-alignments are significant enough to affect downstream analysis, but on VDJServer for example we don't mix the germline genes to avoid these errors.

bussec commented 5 years ago

Ok, this is a good point. In our pipeline we have a common IG/TR reference file and the low number of "cross-overs" we observe turned out to be contamination of the library or errors in MID calling. In those cases it was usually an advantage to have the non-target references in the blastDB, as you will get high confidence calls against the true germline (which you later filter out based on the locus) instead of having to work with e-value/%ID thresholds that are more problematic (especially when working with human B cells). However, I can only speak about a situation which typically gives you 250+ bp of V segment information, it could be different if you are only looking at 40-50 bp.

bcorrie commented 5 years ago

However, I am not sure whether I completely understand the problem with mixed B/T cell data sets in regard to germline databases

The issue comes in during data processing. I'll pick on IgBlast, if you combine the IG and TR germline genes together into a single BLAST db and run IgBlast on a set of mixed sequences, you have the chance of some BCR being aligned to TR gene and some TCR being aligned to IG genes. Now I'm not sure how prevalent this is, or whether the mis-alignments are significant enough to affect downstream analysis, but on VDJServer for example we don't mix the germline genes to avoid these errors.

We have seen similar, using MiXCR, making annotations of IG on TCR datasets because presumably a germline including both was used and the tool was not told to only consider TCR. I think in MiXCR you can use a germline that has both but tell it to only use one or the other, but you need to know that you need to do that (so it is easy to make mistakes). We would need to check with Emily our curator on this for the details...

bcorrie commented 5 years ago

Given the reference above, and looking at this once again, wouldn't it make sense that when you talk about DataProcessing, and where there is a field called software_versions with an example of IgBLAST 1.6 that there should also be a field called germline_database with an example something like IMGT 2019-03-19?

That is, we are saying that the repertoire related to this specific DataProcessing object was produce by igblast version 1.6 using the IMGT Germline DB from 2019-03-19?

Wouldn't you almost always (maybe ALWAYS) need to use a Germline DB of some kind when using an alignment tool, even if it was an internal Germline DB of the tool (for example, maybe the Adaptive case would fit this).

schristley commented 5 years ago

Wouldn't you almost always (maybe ALWAYS) need to use a Germline DB of some kind when using an alignment tool, even if it was an internal Germline DB of the tool (for example, maybe the Adaptive case would fit this).

For an alignment tool, probably yes, but there are tools that generate the germline set de novo from the sequences. Even so, they may use a reference databases, but if they find novel alleles, they have to give them temporary gene symbols.

IIRC, the reason that germline_database was set as a per-sequence annotation was because the meta-data schema wasn't complete and this information was required.

Yeah that's my recollection too. It still leaves open the question of do we eliminate it from Rearrangement or do we leave it there as a back-stop in case there is no repertoire metadata? I think it's okay to leave it, but we probably want to state somewhere that analysis tools should preferentially get this from the repertoire metadata.

It also sounds like germline_database should be attached to DataProcessing object.

schristley commented 4 years ago

germline_database has been added to the DataProcessing object, change is on the metadata-docs branch.

bcorrie commented 4 years ago

@schristley @bussec should this not be added to the MiAIRR standard TSV file as well? I think we want/need this, no?

I have done this (added to level 5) and just made a commit (#613dafe), but now we have two fields with the same name in MiAIRR, which is bad... There is also one at level 6 at the rearrangement level.

Re-asking the question, should we get rid of the rearrangement (level 6) version?

bcorrie commented 4 years ago

Also, is there anything coming out of the Germline DB group that would help us formalize how we can reference a Germline database (or maybe more accurately a "Germline Set")? @schristley I know you were working on this: https://github.com/airr-community/germline-sets

Is this a discussion for another day and we close this issue without sorting it out? Or is this not related?

schristley commented 4 years ago

formalize how we can reference...

Not to my knowledge. There was expression of interest at the last AIRR meeting but nobody has led the effort.

Is this a discussion for another day and we close this issue without sorting it out? Or is this not related?

I consider it an independent question/issue, though related to this and to the schema #157

schristley commented 4 years ago

Re-asking the question, should we get rid of the rearrangement (level 6) version?

That would be my initial suggestion. They aren't really two fields, they are the same field that exists in two places.

bussec commented 4 years ago

Re-asking the question, should we get rid of the rearrangement (level 6) version?

We probably should. IMO this is kind of a retrograde artifact of the GenBank implementation, which requires linking of the germline_database via the /db_xref qualifier.

bcorrie commented 4 years ago

We discussed this at the MinStd meeting this morning, and no one objected to this change. I have removed germline_database from level 6 data, it now is present only in the level 5 data (commit: f21b686)

bcorrie commented 4 years ago

I also removed germline_database from the Rearrangement object in the airr_schema.yaml (aaff67e).

Question: I assume this should also be removed from the Alignment object? I am not too familiar with its purpose...

schristley commented 4 years ago

Question: I assume this should also be removed from the Alignment object?

Yes

bcorrie commented 4 years ago

Done (21ef01e)

bcorrie commented 4 years ago

Travis checks are failing because the AIRR spec in specs is different than the python AIRR spec. How is that change typically managed? Manually?

javh commented 4 years ago

Removing germline_database from Rearrangement and Alignment is a backwards-incompatible change. I agree we should do it, but... I think we'll need to leave the field in these objects until we are ready to do an official release of the metadata spec.

I don't think we ever discussed a formal deprecation process. We should probably have one of those...

Edit: Yeah, specs need to be propagated manually to both the python and R packages.

bcorrie commented 4 years ago

@javh OK to make these changes on metadata-docs branch I assume, then merge of metadata-docs into master is the official release of the new spec (with backwards incompatible change noted)?

schristley commented 4 years ago

formal deprecation process

Yeah that's a good point. I was thinking it was being removed just from the MiAIRR TSV file to avoid the duplicate field but removing it from the actual schema does introduce other issues. As the fields are optional, and it isn't "incorrect" to have a germline_database field in the rearrangements (for example, what if you don't have repertoire metadata?) Maybe we should leave it in?

How about adding a deprecated attribute to the x-airr extensions?

schristley commented 4 years ago

How about adding a deprecated attribute to the x-airr extensions?

sigh, then the validate functions should be updated to check and warning...

javh commented 4 years ago

Is there an official OpenAPI deprecation handling method? A deprecated tag sounds like a good idea to me.

We need to think about timing too... It's a standard, so it can't be changing rapidly on people. We may even have to leave it there until MiAIRR 2.0. I dunno. I think it's a discussion.

schristley commented 4 years ago

official OpenAPI deprecation handling method?

It sounds like no for V2.0 but yes for V3.0

https://stackoverflow.com/questions/49870109/how-to-annotate-a-field-as-deprecated-in-openapi-swagger-2-0

javh commented 4 years ago

sigh, then the validate functions should be updated to check and warning...

That would be the courteous thing to do. Check for the deprecated tag and output a message that says the field won't be validated in a future release.

javh commented 4 years ago

It sounds like no for V2.0 but yes for V3.0

Okay, you're going to have to remind me... Do we need to stick to OpenAPI V2.0?

schristley commented 4 years ago

Do we need to stick to OpenAPI V2.0?

we should to move to V3.0 but it likely needs to be after a 1.3 release for the ADC API. The issue is that various middleware supports V2.0 but not V3.0. As V3.0 is still fairly new, we need to wait until more tools support.