airr-knowledge / issues

Issues and project management for the AKC
0 stars 0 forks source link

Generate/document list of 'fields' for IEDB #4

Closed bpeters42 closed 4 months ago

bpeters42 commented 9 months ago

This is for the IEDB contribution to #3 .

So I think we will have a spreadsheet with something like the following columns: a) Field category b) specific field c) present in T cell assay d) documentation in T cell assay e) present in B cell assay f) documentation in B cell assay

?

We need this by 9/11, ideally earlier, to align with the other projects.

rvita commented 9 months ago

attached is a sheet comparing fields present in the receptors export to the same fields in the t cell assay export (look at 1st tab). if this is what you are looking for, let me know & I will do the same for bcrs. I do not know what d) documentation in T cell assay means, can you elaborate?

tcell_fields.xlsx

bpeters42 commented 9 months ago

Unfortunately this is not what I meant (or maybe only partially).

I was referring to a sheet with all B cell assay and T cell assay export fields, with the fields aligned where they are shared. That includes e.g. 'immunization'. We had made such a sheet when we did the export redesign. I think it is the best documentation we have for all fields that is understandable for outside users.

The TCR/BCR fields should also be included, and can be in a separate sheet like you have them now.

Overall the goal is to align the fields we capture with what is in the other repositories, and come up with a 'unified common data model'.

Happy to do a call if it still doesn't make any sense.

rvita commented 9 months ago

ok, attached is the sheet, initially generated by Jason, that contains all the db table names & details in columns A, B, C, D and the manually renamed export names in columns R, S, T, however, once Kelly got these on the test site, we likely deviated from this initial design, based on testing feedback. This sheet does include tables: CHAIN_NEW, DISTINCT_CHAIN, DISTINCT_RECEPTOR, CURATED_RECEPTOR, DISTINCT_RECEPTOR_RECEPTOR_GRP, and RECEPTOR_GROUP, but we did not work on any of those export names when we made this sheet. They were intentionally skipped as a future task. I'll ask Jason to regenerate this sheet with the current names in the exports, as a starting point.

iedb_field_descriptions_RV.xlsx

bpeters42 commented 9 months ago

We worked on this sheet, cleaned it up, made it consistent across curation / external, and between assay types. That turned into a sheet that had the preferred names of each field to be used in the export along with the documentation that we are now including. I hope I am not just dreaming that. If you can't recall, I will start digging.

bpeters42 commented 9 months ago

Brian just posted the current version of the AIRR data format in the google drive: https://drive.google.com/drive/folders/1Xcmx_KYCSKFai1GTyG-qjsG_uTieViyC Again: I am certain we had done an alignment before (for a previous version)

rvita commented 9 months ago

Bjoern, I've asked Jason & Kelly about a mapping like you describe and they also only remember the previously attached excel sheet as a mapping from db table to export name. We also have this ticket about updating the receptor export to use airr labels that Lonneke & I worked on: https://gitlab.lji.org/iedb/external/development/database-export-redesign/-/issues/69 at that time, we came up with this sheet, but it only includes receptor export type data, not immunization, etc. iedb2airr_v2.xlsx I'll start with looking at the AIRR data format and see what I can put together using all 3sheets.

bpeters42 commented 9 months ago

I found it. Every IEDB export you do, there is a 'help' tab. That comes from the spreadsheet that I was trying to talk about but apparently couldn't communicate. It contains the field category ('section), specific field ('Header'), examples, definitions and documentation.

For AIRR, we need the 'T cell assay' and 'B cell assay' exports of those fields. We can just copy them in here, but in the source spreadsheet they are aligned, and the sections that are shared between B cell and T cell are identified.

jamesaoverton commented 9 months ago

I ran exports for some arbitrary TCell and BCell searches from IEDB.org, copied the 'Help' sheets, transposed them and merged them in this Google Sheet: https://docs.google.com/spreadsheets/d/1RaknOd97GSomgqqttazGZPQXmQl1pNPkd6KVmGM7kBU/edit#gid=1632698505

I see 160 TCell fields and 131 BCell fields. The BCell fields are a proper subset of the TCell fields, because the TCell export adds fields for some "in vitro" details, effector cells, antigen presenting cells, and MHC restriction.

There are examples, documentation strings, and help links for all the fields.

If this is the right thing, feel free to move it into the shared Google Drive, and let me know what the next steps are.

rvita commented 9 months ago

This is the mappping I've been working on. AIRR_IEDB_maps_RV.xlsx

bpeters42 commented 9 months ago

What James sent is what I was asking for. I uploaded it https://docs.google.com/spreadsheets/d/1kMmANqAhg2ujURdRnSZBV-R0KO5Rxv7h

We are now stitching together work Randi did before - but apparently I never was able to communicate what I wanted to her.

The TCR / BCR things will be important later, as will be the mapping to AIRR, so we will want to use Randi's other file at some point.

schristley commented 9 months ago

This is the mappping I've been working on. AIRR_IEDB_maps_RV.xlsx

I've put a copy of this on the google drive