airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

mechanism to record raw GEX sequencing files attached to a repertoire? #567

Open schristley opened 2 years ago

schristley commented 2 years ago

At least in the 10x case, getting VDJ is often accompanied by gene expression data. As we are defining a format for the processed GEX data #409 , should we also provide a simple mechanism to record the raw sequence files for GEX? This would be useful when the study comes from an archive like SRA.

bcorrie commented 6 months ago

@schristley would you still consider this v2.0?

schristley commented 6 months ago

@bcorrie Have you been using CellExpression? How have you been annotating the sequence files used for GEX versus the sequence files used for VDJ? For VDJ, we have sequencing_data_id which is "Persistent identifier of raw data stored in an archive (e.g. INSDC run ID). Data archive should be identified in the CURIE prefix"

bcorrie commented 6 months ago

For study: PRJCA002413 - https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA002413 The run IDs are here: https://ngdc.cncb.ac.cn/gsa/browse/CRA002497

For subject ERS1: Our sequencing_data_id looks like this: "CRR126563, CRR126564, ..." Our sequencing_data_files look like this: "CRR126563_f1.fastq.gz, ..."

image

The AIRR Spec doesn't have CURIE prefixes for the INSDC repositories, as we never went down the path of defining them. And indeed this is from the Chinese equivalent (GSA), so even if we had INSDC curie prefixes they would not be appropriate. So we are just storing the IDs not the CURIEs.

I am hoping that all such studies are curated this well - but this is one we started from the Chinese GSA and did everything ourselves, so we did actually download everything and run it through the 10X pipeline I believe 8-)

bcorrie commented 6 months ago

Hmm, what we don't seem to have captured is the details on how the cellranger pipeline was run. It tells which version was used (cellranger 6.0.1) but doesn't have much detail on how it was run on the CRR files to get the normal 10X files. Our data_processing_protocols says the following, which is starting from where the 10X multi pipeline ends (with the matrix, barcodes, etc files):

Processing by iReceptor: 

To create AIRR Cell files, script 10x-vdj2cell.sh was run on b-cells and t-cells.

To create AIRR GEX files for b-cells and t-cells:  

gunzip $COUNT_MATRIX_DIR/*.gz
tail -n +4 $COUNT_MATRIX_DIR/matrix.mtx > $COUNT_MATRIX_DIR/$SUBJECT-matrix-trimmed.mtx
python3 10x-vdj2GEX.py $VDJ_B_DIR/cell_barcodes.json $COUNT_MATRIX_DIR/barcodes.tsv $COUNT_MATRIX_DIR/features.tsv $COUNT_MATRIX_DIR/$SUBJECT-matrix-trimmed.mtx > $OUT_DIR/$SUBJECT-vdj_b_gex.json
bcorrie commented 6 months ago

Other than that, it seems like it would be pretty easy to reproduce the data we have in the repository...

schristley commented 6 months ago

For study: PRJCA002413 - https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA002413

Ok, it took me a bit to figure out what was going on, but what I'm able to gather is that extra repertoire objects were created to hold the RNA-seq data, like this one:

https://covid19-1.ireceptor.org/airr/v1/repertoire/PRJCA002413-ERS1-TR-CELL

So I guess there must be some special encoding that iReceptor uses to know a repertoire is RNA-seq versus AIRR-seq? The gateway shows 75 repertoires while vdjserver thinks there are 135 repertoires.

Our sequencing_data_id looks like this: "CRR126563, CRR126564, ..."

Ok, well this is not the greatest because those fields are singular value and you are storing arrays in them, but that's better than nothing.

I guess this is the result of the "one repertoire, one sample processing" design? Even though there are 4 separate runs, instead of having 4 sample processing records, they are merged into a single one.

schristley commented 6 months ago

This is getting off topic for the issue, but I've not looked at the 10X studies in iReceptor too closely. I know you've said that you split up the loci into separate repertoires, but if you wanted to put those chains back together, how do you know which repertoires to use? My human brain can guess based upon looking at the IDs, but I don't see how to do it in an automated way except for some rule that is custom to how they were curated.

schristley commented 6 months ago

Getting back to my original question, I was thinking additional custom fields were added to the repertoire to point to the GEX but it looks like additional repertoire objects were created instead. So how do you link up the repertoires that go together?

The only thing I can guess by looking at the data is with the sample_id? But that will not work in general because the same sample can be processed in different ways, for example, do cell sorting to split CD4 vs CD8 on the same ERS1 sample, then do sequencing on both. According to iReceptor curation, that would give 4 repertoires for the four combinations of CD4/CD8 and TRA/TRB. You don't want to mistakenly match up CD4 TRA with CD8 TRB.

schristley commented 6 months ago

@schristley would you still consider this v2.0?

Mostly likely not. I don't see a simple extension of SequencingData or SequencingRun that would work for the majority of cases. The GEX sequencing could be a completely separate SequencingRun with its own values. Furthermore, as SampleProcessing was designed specifically for AIRR-seq, confounding it by also using it for RNA-seq, is not a good idea.

With the AKC project, it will need to extend "sample processing" to encompass the different types of assays from IEDB and others. That can be brought back into AIRR in a more cohesive way, so moving it to that milestone.

bcorrie commented 6 months ago

So I guess there must be some special encoding that iReceptor uses to know a repertoire is RNA-seq versus AIRR-seq? The gateway shows 75 repertoires while vdjserver thinks there are 135 repertoires.

15 subjects 5 (IGH, IGK, IGL, TRA, TRB) = 75 (Rearrangements) 15 subjects 2 (B cell, T cell) = 30 (Clones) 15 subjects * 2 (B cell, T cell) = 30 (Cells)

If you look at the Metadata page for the study it looks like this on the Gateway:

image
bcorrie commented 6 months ago

Our sequencing_data_id looks like this: "CRR126563, CRR126564, ..."

Ok, well this is not the greatest because those fields are singular value and you are storing arrays in them, but that's better than nothing. I guess this is the result of the "one repertoire, one sample processing" design? Even though there are 4 separate runs, instead of having 4 sample processing records, they are merged into a single one.

Yes, but I think we all recognize that the sample_processing/data_processing model is not perfect. We feel that this is better than splitting those up into separate sample processing when it is really the sequence processing that is different (it is the same sample and sample processing over multiple runs). There is no effective difference between these other than the file name and sequence run ID.

bcorrie commented 6 months ago

his is getting off topic for the issue, but I've not looked at the 10X studies in iReceptor too closely. I know you've said that you split up the loci into separate repertoires, but if you wanted to put those chains back together, how do you know which repertoires to use? My human brain can guess based upon looking at the IDs, but I don't see how to do it in an automated way except for some rule that is custom to how they were curated.

The chains are linked by cell_id - the repertoire they are in does not matter.

bcorrie commented 6 months ago

So I guess there must be some special encoding that iReceptor uses to know a repertoire is RNA-seq versus AIRR-seq? The gateway shows 75 repertoires while vdjserver thinks there are 135 repertoires.

The iReceptor Gateway knows that the repertoire contains AIRR-seq data because it searches the repertoire and it finds Rearrangements. If it finds Clones it has Clones, and if it finds Cells it has RNA-seq. The Gateway is just reporting what it finds in the repertoires that meet the search criteria (study_id = PRJCA002413). In this case it found 135 repertoires, it searched them all and it found 75 with Rearrangements, 30 with Clones, and 30 with Cells.

This should work fine if a repertoire has more than one of the above. That is if you have a single repertoire with different sample processing and/or data processing (e.g. one sample/data processing contains Cells and one contains Rearrangements in the same repertoire). I say should because there is no data of this type in the ADC at the moment - so this hasn't been tested...

schristley commented 6 months ago

So I guess there must be some special encoding that iReceptor uses to know a repertoire is RNA-seq versus AIRR-seq? The gateway shows 75 repertoires while vdjserver thinks there are 135 repertoires.

The iReceptor Gateway knows that the repertoire contains AIRR-seq data because it searches the repertoire and it finds Rearrangements. If it finds Clones it has Clones, and if it finds Cells it has RNA-seq. The Gateway is just reporting what it finds in the repertoires that meet the search criteria (study_id = PRJCA002413). In this case it found 135 repertoires, it searched them all and it found 75 with Rearrangements, 30 with Clones, and 30 with Cells.

ohh, I'm afraid to look... ;-) Yeah, the Study object is different from repertoire to repertoire... My guess is keywords_study is being used to differentiate. That's a "gotcha" we'll need to keep in mind when data is integrated into the AKC.

bcorrie commented 6 months ago

So I guess there must be some special encoding that iReceptor uses to know a repertoire is RNA-seq versus AIRR-seq? The gateway shows 75 repertoires while vdjserver thinks there are 135 repertoires.

The iReceptor Gateway knows that the repertoire contains AIRR-seq data because it searches the repertoire and it finds Rearrangements. If it finds Clones it has Clones, and if it finds Cells it has RNA-seq. The Gateway is just reporting what it finds in the repertoires that meet the search criteria (study_id = PRJCA002413). In this case it found 135 repertoires, it searched them all and it found 75 with Rearrangements, 30 with Clones, and 30 with Cells.

ohh, I'm afraid to look... ;-) Yeah, the Study object is different from repertoire to repertoire... My guess is keywords_study is being used to differentiate. That's a "gotcha" we'll need to keep in mind when data is integrated into the AKC.

When we originally curated the data we thought we might need to use that, but we don't. We actually search the repertoire to see what data it has. More accurately, we search for the repertoire_id on each endpoint (/rearrangement, /clone, /cell) and if it returns data then that repertoire_id is reported of having data of that type. That is why I am pretty confident that the Gateway would work if a single Repertoire had multiple sample/data processing (e.g. a Rearrangement and Cell) as the repertoire_id is searched for the two endpoints it would find data in both cases. Which is exactly what you want. We do not rely on any metadata to limit/direct the searches across the endpoints.

The "nice" (???) thing about our un-normalized storage of the data is that it is "not too hard" (???) to normalize the data based on any "structure" you want. So this is indeed what the AKC will need to do. If the AKC wants to have a specific rigorous definition of a Repertoire (e.g. all data from a single subject across all tissues at a single time point) I believe that could be reconstructed from the data as we have curated it in the ADC. Similar if the AKC wanted to have a "sampled-from" relationship between a subject and all samples from that subject, I believe that could be reconstructed (such a relationship can not currently be captured in the ADC). So you can construct (normalize?) the data for a more complex relationship model if you want.

This is essentially what the Gateway does. On the Metadata page, it presents a Repertoire view of the data. If a repertoire has multiple objects in one of its sub-arrays (sample processing or data processing objects), it will combine them into the report on the repertoire (sometimes maybe in confusing ways - e.g. pcr_target - but they are combined in the report).

If you click on the Stats for a repertoire, it presents more of a study view (with the Stats for the Repertoire of interest). If you wanted to represent the study 100% correctly, based on our curation, you would need to get all of the repertoires associated with a study and combine fields that had arrays in them (keywords_study, subject, sample_processing, data_processing, diagnosis, pcr_target) to get an accurate representation of what is in the aggregate object of interest (e.g. Study). One would hope that the non-array fields would all be identical for the higher level objects - all the subjects in the study would have the same study data and all the samples from the same subject would have the same subject data.

Back in the old days, we had a hybrid Study/Repertoire presentation where you could drill down, but we changed that back in v2 of the Gateway in preference to the Repertoire view.

schristley commented 6 months ago

his is getting off topic for the issue, but I've not looked at the 10X studies in iReceptor too closely. I know you've said that you split up the loci into separate repertoires, but if you wanted to put those chains back together, how do you know which repertoires to use? My human brain can guess based upon looking at the IDs, but I don't see how to do it in an automated way except for some rule that is custom to how they were curated.

The chains are linked by cell_id - the repertoire they are in does not matter.

@bcorrie Ok, that might be inefficient for some data sets but I can see it working. So this implies that cell_id is globally unique? If you are matching cell_id across repertoires, those will have different repertoire_id ...

Have we defined the uniqueness level of cell_id? There is a comment near CellExpression with "cell_id is not guaranteed to be unique outside the data processing for a single repertoire.", but I'm not sure I understand what that means. The description of cell_id in Cell is vague about uniqueness.

bcorrie commented 6 months ago

We are in the process of cleaning up these descriptions, using the changes:

https://github.com/airr-community/airr-standards/pull/705/commits/bb32bef80bac190fa2c98c14fe057d21e9b86399

based on the changes made here to CelLReactivity.

https://github.com/airr-community/airr-standards/pull/705/commits/7a4dfcd119cf6a8b649607792d8c722ab9c5659d

I think this is better???