Sage-Bionetworks / cleanAD

Tools for cleaning and organizing study data for the AD Knowledge Portal.
Other
0 stars 1 forks source link

Banner proteomics metadata discrepancies #14

Open avanlinden opened 2 years ago

avanlinden commented 2 years ago

@jgockley62 identified inconsistencies in CERAD scores for individuals from the Banner cohort in the original Banner LFQ proteomics traits file and the new Consensus project TMT proteomics on the same samples.

The original Banner study needs updated metadata files that meet our current metadata standards. The original Banner case IDs (individualIDs) have been corrupted and lost from the existing traits file and can be taken from the consensus project traits file. The CERAD discrepancies are due to a change in how CERAD was evaluated and a note should be added to both study descriptions.

avanlinden commented 2 years ago

Jake's original email to Eric Dammer:

Hey Eric,

I was digging around the Banner LFQ/TMT samples and I ran into a bit of a conundrum. The CERAD scores for individuals are quite different between the former LFQ samples versus the new TMT samples.

The LFQ meta-data synID we have is syn9740295 And I'm using the new TMT from the consensus paper located here: syn25006658

I matched the samples from individuals with TMT and LFQ and noticed some discrepancies:

LFQ - syn9740295 table(comp_trial$CERAD) -1 0 1 2 3 . 36 4 37 5 78

TMT - syn25006658 table(temp_trial$CERAD) 0 1 2 3 23 15 25 97

And compared also has some discordance beyond a simple adjustment table( comp_trial$CERAD, temp_trial$CERAD)

     0   1   2  3

-1 22 13 0 1 0 1 1 2 0 1 0 1 17 19 2 0 0 5 0 3 0 0 1 77

Not too sure where the discordance comes from but I thought I'd try and track it down. I cc'd Mette and Abby on our DCC team as they have more info on the LFQ data side.

Best, Jake

avanlinden commented 2 years ago

Eric's reply:

This overlooked CERAD discrepancy is troubling and should be addressed without compromising reproducibility of the published LFQ consensus analyses. See explanation in the email just forwarded to you and Mette (cc: Jim and Erik). I recommend keeping both the Mirra 1991 based score and adding in the updated plaque density-based CERAD, independent of cognition in the traits for the LFQ 201 Banner cases.

I cannot see any Banner case IDs in the LFQ traits on the SynID you provided, but do see the 201 cases with their original batch_runNumber file ID. The Banner IDs which were 2 numbers separated by a dash likely corrupted into date formatted cells by excel and then discarded, had to be remapped to the file IDs so that the CERAD differences for the same Banner IDs are clear. Please rely on the censored traits for the same 201 case samples in the Nature Neurosci TMT Banner traits attached here, based on Tom Beach's February 2019 update of CERAD from the prior Mirra 1991 criteria-based scores. The green tab has the map of Banner ID to LFQ fileID to TMT batch.channel, along with both CERAD score versions (Mirra 1991 and Beach 2019).

Sincerely,

Eric

The files Eric attached contain some potentially PHI so I uploaded them in the Staging folder of the original Banner study here: https://www.synapse.org/#!Synapse:syn26403225.

avanlinden commented 2 years ago

Jake identified three missing sample IDs from Eric's attached files that are not in the original LFQ metadata: Sample IDs are: b4_134_04, b4_007_23, and b3_041_03

Eric responded:

It looks like 9 case samples per TMT batch x 22 banner TMT batches = 198, which is short those 3 cases from the 201 originally purchased, received, and run for LFQ proteomics dating back to 2014.

Tom Beach's sheet in response to Erik's questions in the forwarded email should have the 3 corrected/updated CERAD scores, however.

avanlinden commented 2 years ago

Further information from Eric on the CERAD score changes:

Jake,

I confirm the discrepancy in CERAD 0-3 (previously 0, A, B, or C and corresponding literal key) for a number of the same 201 case samples from Banner Sun Health between the LFQ and the TMT traits for prefrontal cortex proteomics. I think the explanation you need dates back to the below February 2019 email from Tom Beach at Banner in response to our request to guarantee accuracy of the scale, and adaptive renumbering he performed at that time, and that we later used for the TMT, but did not correct/update in the LFQ traits. See below.

In a direct reply to your RFI, I will attach the full trait comparison with Banner IDs mapped to both LFQ and TMT batch runNumber/channel and corresponding CERAD. The discrepancies should make sense given the below logic.

Sorry we did not go back and amend the traits for the LFQ at the time.

May I suggest Mette, and the clinician scientists (Jim and Erik, cc:) confer on how best to address the LFQ traits? For reproducibility, the 1991 scoring used for correlations with the LFQ data should probably be retained, but displayed alongside the updated CERAD scores consistent with quantitative plaque density.

Sincerely,

Eric

The files he attached (PDF explaining CERAD scores and mapping file) are in the Staging folder: https://www.synapse.org/#!Synapse:syn26403225.

avanlinden commented 2 years ago

I saved the forwarded 2019 email thread from Erik Johnson and Thomas Beach explaining the CERAD changes as a pdf and uploaded it here (too long): https://www.synapse.org/#!Synapse:syn26403241.