Sage-Bionetworks / cleanAD

Tools for cleaning and organizing study data for the AD Knowledge Portal.
Other
0 stars 1 forks source link

MSBB #2

Closed Aryllen closed 3 years ago

Aryllen commented 4 years ago

Study folder: syn3159438

We can expand these checks to be more specific, or mark them off/remove them if they are not relevant.

Folder Structure

Metadata (within file) Checks for each metadata file:

file exists file name follows schema contents follow current template - deprecate old versions, if needed no duplicate individualID/specimenID as appropriate follows data dictionary

Metadata (across files)

Annotations

Multispecimen Files Check that specimenIDs in files match IDs in metadata

Wikis

Clinical data

Access (Human)

Portal

Metadata specifics

amapeters commented 4 years ago

Let's add the following to the list of things to check

Clinical data

  • [x] Braak and CERAD is available on donors with postmortem tissue
  • [ ] Permission to use Braak and CERAD to generate Dx (AD, NCI, Other) for data contributor

Access (Human)

  • [ ] Genomic summary results are in 'Analysis' folder which does not have access control

Portal

  • [ ] Review content on the study card for accuracy
  • [ ] Review text formatting and 'Show More' section: ### for header, bold for sub-headers, Show More section broken up in a consistent manner on the card
  • [ ] Related studies are linked
  • [ ] Study has an acknowledgement statement (wikis here)
  • [ ] MODEL-AD data specific: There is a link on the experimental tool card(s) to the study

Edited by Nicole: moved the quoted items to the main issue so that it counted it in the progress bar.

amapeters commented 4 years ago

To do specifics Metadata that needs to be updated to current schema

  • [ ] Biospecimen - this file may need to be updated with specimens from WGS and Proteomics. Review across all assays
  • [x] RNAseq - syn6100548 needs to be converted to a RNAseq metadata file. Note the remap information. This is based on a sample identity QC they did remapping and excluding some samples. Meagan reviewed this, but we should double check that the biospecimen file maps the samples to the correct remapped individual
  • [x] WES - syn6101472. Same issue as for RNAseq
  • [x] WGS - syn11384608. Same issue as for RNAseq
  • [x] label free Proteomics - syn6100412 Edited by Nicole: moved the quoted items to the main issue so that it counted it in the progress bar.
Aryllen commented 4 years ago

New methylationArray metadata file uploaded to staging. Added missing column arrayBatch and rearranged columns to match template order. This also has Sample_Name column, which I believe is the ID used in the methylation data, but is mapped to specimenIDs. Will need to verify and perhaps make a note of this in the methods.

Update: The Sample_Name column is what is used in the methylation data. Users will need to get the specimenID by cross-reference with Sample_Name. If I remember correctly, I think the Sample_Names were projids, which is why we had them create a mapping to new specimenIDs in the first place.

@amapeters, @karawoo, what route will we take on these cases? We could change the multi-specimen file to use the specimenID or we could make a note in the methods section (or some other visible location) that the IDs in the data refer to Sample_Name.

Aryllen commented 4 years ago

@amapeters, I'm looking at the rnaSeq covariates file. I'm trying to understand the different variables mentioned on the wiki and figure out where this data should "live." A lot of it looks like it should be in the individual metadata file (have not checked, yet) and the biospecimen metadata file. However, I'm confused as to what should be done with the individualIdentifer and individualIdentifier.inferred. One appears to be the individualID that the sample should have been from and the other appears to be the individualID that analysis from the assay says it should have been from.

Which of these IDs is the one that should be associated with the specimenID in the biospecimen metadata?

Aryllen commented 4 years ago

Ignore this. Redid the label free mass spec proteomics and comments can be found in new reply to this.

Label free mass spec proteomics (new metadata file here):

- CaseID was in the biospecimen metadata (with the exception of CaseID 0) so I made CaseID = specimenID

- 18 specimens where in batches 7 and 8. I currently have 'batch' as comma separated for these since there is no other batch information.

- This file has some rows that were in the proteomics covariates file, but have no information other than a RunName (?), batch number, and a CaseID of 0. Not sure what we should do with this data. CaseID 0 is not a specimenID in the biospecimen metadata.

- platform and assay were taken from the assay description.

- Data that should have been in the individual metadata file was matched by individualIdentifier and checked for consistency. There is data in the covariates file that is not in the individual file, however (bbscore, PlaqueMean, NP1).

karawoo commented 4 years ago

Regarding individualID vs individualID.inferred -- I believe the inferred ones should be more reliable. If I recall correctly, these came from resequencing data and discovering that some samples had been mislabeled or otherwise mixed up (@amapeters might remember better).

karawoo commented 4 years ago

Per discussion with Nicole, I will take on the WGS data first

Aryllen commented 4 years ago

Kara and I discussed some of the issues with the proteomics data. I will need to take the RunName, separate off the last section and compare that to the biospecimen specimenId and annotations. The CaseId 0 data does have files, but they are annotated with a portion of RunName as the specimenId.

karawoo commented 4 years ago

Added WGS file here: https://www.synapse.org/#!Synapse:syn22360825

The original WGS covariates file (https://www.synapse.org/#!Synapse:syn11384608) has the sampleIdentifier column which I remapped to specimenID. Other than that, none of the original columns seemed relevant to the assay metadata (they were all individual- or biospecimen-level information). I added platform and assay based on the assay description.

Aryllen commented 4 years ago

Mette and I discussed this today.

She requested that we focus on WES, WGS, and RNAseq. This data was done by Bin Zhang at Sinai (Ming Wui is data liason). Make sure that the biospecimen metadata maps to the individualIdentifer.inferred. Double check that the covariates data matches the data in the individual/biospecimen metadata.

Aryllen commented 4 years ago

RNA Seq metadata New rnaSeq metadata file uploaded here.

Notes:

Extra columns moved to metadata (outside the scope of the template requirements):

The specimenIDs in the biospecimen metadata appear to be matched with the individualIdentifier.inferred value with the exception of the following specimenIDs. The corresponding individualIdentifier.inferred values are "." in the covariates file, but the individualID listed is the individualIdentifier. Generally, this means the Action is "exclude" for this specimen's rnaSeq data.

BM_22_245_H154B394, BM_22_93_S113B355, hB_RNA_10432_K77C014, hB_RNA_11012, hB_RNA_12302, hB_RNA_12392_E007C014, hB_RNA_12744_L43C014, hB_RNA_13039_B82C014, hB_RNA_13320_P60C014, hB_RNA_13373, hB_RNA_13609_P60C014, hB_RNA_4782_L43C014, hB_RNA_4991, hB_RNA_5001, hB_RNA_7995_E007C014, hB_RNA_8255, hB_RNA_8475, hB_RNA_8515_K85C014, hB_RNA_8525_K85C014, hB_RNA_8855, hB_RNA_9140_K75rC014, hB_RNA_9190_E007C014, hB_RNA_9208_resequenced, hB_RNA_9226_K82C014

Gene counts Raw and normalized count file concerns:

@amapeters, what do you suggest for these count file issues?

We also have both bam and fastq files for the raw data. I renamed the folder for now since it just said BAM.

Aryllen commented 4 years ago

WES metadata New wes metadata file uploaded here.

Notes:

Extra columns moved to metadata (outside the scope of the template requirements):

The specimenIDs appear to be matched with the individualIdentifier.inferred as the individualID in biospecimen metadata. There are 4 exceptions where the inferred value is NA, in which case the individualID in biospecimen metadata is the individualIdentifier. These are:

BM_22_837, BM_22_912, BM_22_941, BM_22_956

There are two specimenIDs that are missing from the biospecimen metadata: hB_DNA_12775, BM_22_985. These have the Action of Exclude in the WES covariates file. I added these to the biospecimen metadata file. New biospecimen metadata file is here. Please use and create a new version of this new file for all future cleaning updates.

WES multispecimen files use the specimenIDs listed in the assay.

Aryllen commented 4 years ago

WGS multispecimen vcf files (checked chromosome 21) appear to use the specimenIDs in the assay metadata.

The following WGS specimens have "unknown" as the individualID in the biospecimen metadata.

71729, 71823, 71843, 71962, 76354, 76655

All of these except for 76354 have individualIdentifier values in the WGS covariates file. Updated the individualIDs in the biospecimen metadata file and uploaded new version. 76354 will have to stay as "unknown" for now.

Aryllen commented 4 years ago

Label free proteomics metadata Metadata file here.

Notes:

Most specimens were not in the biospecimen metadata. The missing samples were added and a new biospecimen metadata file uploaded. For these, the organ and tissue were gathered from the annotations on the proteomics data. The individualIDs were all in the individual metadata.

The following specimens in the biospecimen metadata file have "unknown" as the individualID and there is no individualIdentifier in the covariates file.

b1_bmgis_01, b1_bmgis_22, b1_bmgis_43, b2_bmgis_01, b2_bmgis_22, b2_bmgis_43, b3_bmgis_01, b3_bmgis_22, b3_bmgis_43, b4_bmgis_01, b4_bmgis_22, b4_bmgis_43, b5_bmgis_01, b5_bmgis_22, b5_bmgis_43, b6_bmgis_01, b6_bmgis_22, b6_bmgis_43, b7_bmgis_01, b7_bmgis_22, b7_bmgis_43, r2b7_bmgis_01

The proteomics covariates file has three columns of individual metadata that is not included in the actual individual metadata file: bbscore, NP1, and PlaqueMean. I have asked @amapeters if we should deprecate this information when we deprecate the covariates file or add extra columns to the individual metadata, even though we do not have this data for all patients.

The multispecimen file uses the form "Peptides specimenID". Do we want to remove the excess "Peptides " bit from these, @amapeters?

Aryllen commented 4 years ago

@karawoo, do you know how to open the TMT proteomics files (syn21347564) to see what they are using for identifying the specimens? The pepxml files are not opening for me (tried pepXMLtab from bioconductor so far), and I am not sure how to read the raw files at all. I tried googling for a second, but figured I would ask you before I go too deep in the weeds.

karawoo commented 4 years ago

I do not :(

Aryllen commented 4 years ago

individual metadata Removed row with individualID = unknown. There was no other information in that row. Accidentally pushed new version to the original file. 🤦‍♀️ Also uploaded to the staging cleaning folder in case it needs future updating. Will pull/push to the staged version, instead.

Aryllen commented 4 years ago

biospecimen metadata

Reminder to me: All individuals have ageDeath, but the biospecimen metadata is blank in the isPostMortem column. Should add 'True'.

Aryllen commented 4 years ago

TMT Proteomics annotations

Label free mass spec

Aryllen commented 4 years ago

RNA seq annotations

Aryllen commented 4 years ago

WES annotations raw data

processed data

WGS annotations

Aryllen commented 4 years ago
Aryllen commented 4 years ago
Aryllen commented 4 years ago

Sent email to Bin and Minghui, with changes requested from Mette and slight reformatting.

Hi Bin and Minghui,

As you know, we have been working on getting older studies up to date on the new standards that we have for data in the AD Knowledge Portal. In general, this means ensuring the metadata matches our metadata templates, and checking that the specimen identifiers (specimenID) in files link back to the metadata.

Since MSBB is one of our most valuable datasets, we want to make sure it is usable for as many people as possible and can easily be harmonized with the other portal data. To complete this, we have a few questions.

  1. We noticed that the RNA sequencing counts files have a couple issues. The header appears to have been shifted left, which makes the first specimen name the header for the ensemble IDs. The specimenIDs also appear to be prepended with batch information, which is already in the metadata files. Would it be acceptable to you if we: shifted the header to the right by 1 (this would align the specimen names with the count columns), added 'ensemble id' as the first header value to signify what is in that column, and removed the prepended batch information from the specimen names (the specimenID would be leftover and could be matched to the batch information in the metadata files)?

  2. We have taken the covariate files and created new metadata files from these. Information related to individuals has been moved to the individual metadata file, while information related to biospecimens and their respective assays have been moved to the biospecimen and assay metadata files. These files will replace the covariate files currently in the metadata folder. The covariate files will still be accessible, but we will not surface them in the AD Knowledge Portal. Can you please check the the metadata files for data that your group generated and verify that they appear correct? The metadata files are in a staging folder, but Minghui should have permissions to view them.

  3. Along with checking the metadata files, please note that we have kept the information about the QC remapping in the biospecimen file. Since the remapped specimens are now linked to the correct individual, can we remove the remapping information? We are a bit concerned that it will be confusing. We suggest leaving the 'Exclude' information in, but would like to add text explaining why you recommend they should be excluded. Can we state the following? "A specimen QC identified samples that could not be mapped to the expected individual. These have been indicated as 'Exclude" in the biospecimen metadata file".

Best,

Minghui's response:

Nicole, Great your team can reformat the data files.

  1. Regarding the RNA sequencing counts files, it is totally fine if you shift the header and add in a new column id. Batch information can be discarded from the sample id as long as such information is kept in the associated metadata file.
  2. I will look at the metadata file when I have a chance.
  3. Fine with me.

Minghui

Biospecimen file has been updated to remove all "Action"/remapping values except for "Exclude" and individualIDs for the excluded samples have been removed. Moved the metadata wiki info to the staging folder since it might be confusing if people stumbled on that before we updated things. Changed note about the "Action" column to be the information about some being excluded as Mette suggested.

Aryllen commented 4 years ago

RNA seq count files have been updated and uploaded to a staging folder (syn22988362):

Aryllen commented 4 years ago

Proteomics metadata

TMT quantitation metadata

Aryllen commented 4 years ago

Updated study in portal studies table.

Wiki layouts

Analysis:

WGS:

Proteomics:

TMT Quantitation:

Annotations I have already checked the annotations, but we have since updated the requirements to prepare for creating a schema. We should use the schema (when it's ready) to audit these annotations, again.

Aryllen commented 4 years ago

Update on MulticonsensusStudyFileID: This definition is tied to the use of a specific program. They did not use it and do not have this information. This column should be empty in the TMT metadata.

Aryllen commented 4 years ago

Email sent to Duc to check over the proteomics metadata.

Have not heard a 'go-ahead' from Minghui regarding the rest of the metadata, yet.

Aryllen commented 4 years ago

INPP5D (in the staging folder) is a different study that is from a grant we support. Need to move it into it's own study folder. @amapeters to follow up with Minghui on when this should be published. Will need to verify that it fits our model (metadata, descriptions, folder structure, etc). Created issue in AD-DCC to track (https://github.com/Sage-Bionetworks/AD-DCC/issues/692). Update: moved out of MSBB staging.

Aryllen commented 4 years ago

Minghui sent corrected biospecimen and RNAseq metadata files. The corrections in the RNAseq metadata are things that I wouldn't know how to verify (sequencingBatch, RIN, etc). It seems like Minghui ended up shuffling the rows. The biospecimen metadata is a problem in that Minghui removed 290 rows, all of which were for proteomics data except for 1. Again, it looks like there was also a shuffling of rows. I am not sure what exactly was changed in each row, yet, but did respond to Minghui asking for a reason why he removed the data.

Update: Minghui claims that these were duplicates, but it's specimens used by the proteomics data. I said I would take a closer look, but will most likely need to tell Minghui, 'thanks for the update, but we are keeping them'.

amapeters commented 4 years ago

I reviewed the MSBB biospecimen metadata file: https://www.synapse.org/#!Synapse:syn22453847

amapeters commented 4 years ago

Hi Duc,

We are doing some cleanup of our high value studies, including MSBB. We noticed that the sample identifier in the protein output file is flipped compared to the identifier on the .raw files (which we use as the specimen ID). For example, MSBB_Proteomics_PFC_RAW_b1_1497_21.raw ('b1_1497_21') is labeled as 'b1_21_1497' in the protein output file. Do you mind if we update the protein output file with the raw file ids? We will provide a new version (ie, not delete the current file) and provide an explanation in our release notes.

Best, Mette

amapeters commented 4 years ago

Hi Mette,

Thanks for doing these cleanup steps. It really helps as we are sometimes blinded to these discrepancies. Please go ahead with the name change.

Best, Duc

Aryllen commented 4 years ago

Proteomics

Biospecimen

Aryllen commented 4 years ago

Ready to deprecate old covariate files and move metadata to public.

Aryllen commented 4 years ago

@amapeters, I checked and the specimenID in question IS in the assay metadata. However, it exactly matches the information for hB_RNA_10892_K77C014. I will ask Minghui if the data should be annotated with the longer specimenID (the one in the biospecimen metadata).

Aryllen commented 4 years ago

There was a discussion thread where Minghui clarified the issue. I updated the names of the files and the annotations to match the batch information, as specified by Minghui.

Aryllen commented 4 years ago

This is 'technically' finished. Final touches would be to:

Aryllen commented 3 years ago

Made a few updates based on Mette's feedback.

Closing because this is now done.