Sage-Bionetworks / cleanAD

Tools for cleaning and organizing study data for the AD Knowledge Portal.
Other
0 stars 1 forks source link

ROSMAP #4

Open Aryllen opened 3 years ago

Aryllen commented 3 years ago

Study folder: syn3219045

We can expand these checks to be more specific, or mark them off/remove them if they are not relevant.

Folder Structure

Metadata (within file) Checks for each metadata file:

file exists file name follows schema contents follow current template - deprecate old versions, if needed no duplicate individualID/specimenID as appropriate follows data dictionary

Metadata (across files)

Annotations

Multispecimen Files Check that specimenIDs in files match IDs in metadata.

Wikis

Clinical data

Access (Human)

Portal

Aryllen commented 3 years ago

Notes:

Aryllen commented 3 years ago

Question: Is this metabolomics folder supposed to be empty? Can it be deleted?

Question: Imaging metadata? Answer: Don't add.

Updates:

rnaSeq

Aryllen commented 3 years ago

Removed tasks related to metabolomics. This assay will be moved out of the main ROSMAP study and will not count toward 'clean' completion.

Aryllen commented 3 years ago

rnaSeq Problem specimen in RNAseq: 492_120515 from batch 1

Fixed duplicates in rnaSeq metadata. The problem specimen, above, has all values in it's column for now and is in the first row of the file. Uploaded to cleaning folder here.

ChIPseq

Problem specimen: 11464261

Updates:

Multispecimen files (will most likely not be using specimenIDs since these would be RIDs when the 'clean' metadata is uploaded):

Aryllen commented 3 years ago

TMT proteomics

scrnaSeq

snpArray

WGS

methylationArray

miRNAcounts

rnaArray

Aryllen commented 3 years ago

label free proteomics

Aryllen commented 3 years ago

General notes:

Aryllen commented 3 years ago

Mapping IDs to projids

Aryllen commented 3 years ago

biospecimen

This is a mess... There are more specimens in ROSMAP biospecimen file than there are specimens in all of the assay metadata files (even without having all assay specimens), meaning there's either too many duplicates in the biospecimen file OR there are missing specimens in the assay metadata files.

ROSMAP_biospecimen_metadata_combined has all assay specimens. The number of specimens matches the total number of specimens in these assay files, with one exception for the single 'control' in proteomics.

Aryllen commented 3 years ago
Aryllen commented 3 years ago

Stuff I did yesterday, but didn't click the 'comment' button on:

Some of this needs to be fixed. Namely, I confused the microglia scrnaSeq with the bulk microglia rnaSeq. This data needs to be moved to the rnaSeq metadata and the biospecimen assay column updated with the correct term.

Other update done today:

Aryllen commented 3 years ago

Question: I have 'excludeReason' in biospecimen. Should I also add the boolean 'exclude'?

yes and done

Question: This is related to Jake's concern. I was thinking this would be a big problem, but it somewhat less so. The idea is that it could be hard to get the exact metadata set desired. For example, we have multiple sets of rnaSeq assays. We can join the metadata files by filtering biospecimen to just rnaSeq assay rows. However, that gives a bigger dataset than what was used in just one of those subsets (microglia, for example). According to Jake, bioinformatics professionals may not be great at joins or cleaning. It would potentially also help with reproducibility/transparency to be able to filter to the exact subset of data. My question is how much work do we want to do for the data users?

Probably best handled with an R package.

Question: Mette mentioned that there appears to be a duplication issue with dlpfc scRNAseq. I'm not following. These seem unique to me.

Misunderstanding. This is fine.

Question: For the FACS sorted bulk cell rnaSeq, I added _1 and _2 to the specimenID. The reasoning is in a comment above. This needs to be approved or improved before I change the annotations on these files.

Unnecessary. Just remove _1 and 2 and have the 10 unique specimenIDs. The users can determine lane by 1 and 2_ in filenames.

Question: How should I be reading the WGS sample swap file? Path forward? Same question regarding "duplicate" file.

Pull in reasons for excluding, GQN, tissue and organ. Check that there are 17 individuals in our dataset with 2 samples each (there are).

Question: There was a question asked about the one specimen with multiple values in the rnaSeq metadata. This was mentioned in a previous comment above, but this will need to be cleared up with whoever is responsible for that data. It is unknown if that sample was run 3 times or if it was accidentally entered 3 times with different values.

Need to ask Yan Li about this.

Question: Do we even want to mess with multispecimen files at all? Many that I have seen use the projid, which can be found via metadata. The 'annoyance' with these is leading 0's, which is something Jake also mentioned. But overall, there seems to be hesitation with changing ROSMAP data at all so should we consider multispecimen files 'clean'?

Nope. Leave them be.

Question: Can I delete this empty folder? Was there supposed to have been data here?

Deleted.

Question: What's with this Staging folder in Proteomics (SRM)?

Move to deprecated.

Question: May I update wiki links for the portal as I finish updating wiki's (same question for Mayo) or does there need to be an approval process? The updates are only formatting and merging wikis that should be together.

Yes. Follow guide in our 1:1 notes for merging wikis and order in Portal.

Aryllen commented 3 years ago
Aryllen commented 3 years ago

Annotations NOTE: I realized that this would be simpler to do once we had all the metadata approved versus before. Stopped after checking a couple folders in WGS. Will need to come back to this. WGS

Aryllen commented 3 years ago

While the checkboxes in the main issue are items that should be completed once we get metadata confirmation, I am adding general reminders here.

Notes:

Aryllen commented 3 years ago

Had a meeting with Mette, Abby, and Yan. Yan said there was probably no one there that could check out the metadata and verify that it was good. He mentioned that our metadata was probably better than what they could provide anyway. With this information, we are going to release the new metadata files.

There is one outstanding issue in the rnaSeq metadata where one specimen has 3 batches. Still need to determine which batch they should be in.

Released:

Updated naming on TMT quantitation.

Covariates to deprecate (?):

avanlinden commented 3 years ago

@Aryllen when you get a minute can you give me edit privileges on this repo? That way I can edit your original comment to check off boxes and such. Thanks!

avanlinden commented 3 years ago

Annotations

avanlinden commented 3 years ago

I've been checking through the updated biospecimen and assay metadata to make sure I have all the info I will need to update annotations. There are a few studies that look good and are ready to go, and a few that I have questions on. Questions and issues for each set of data are outlined here in this doc.

In the meantime I'll start annotating the RNAseq and scRNAseq files.

avanlinden commented 3 years ago

@Aryllen There are ~190 specimens in the updated biospecimen metadata file that are missing individual IDs but do NOT have an exclude = true tag and are NOT pooled samples... I read through all your previous notes but couldn't find anything about this many missing individualIDs. Here's teh breakdown by assay: Screen Shot 2021-03-04 at 4 20 07 PM

Are these just missing? Do I need to try to find these individualIDs somewhere?

avanlinden commented 3 years ago

Notes on bulk RNAseq annotations:

To do:

Aryllen commented 3 years ago

@avanlinden, I think I mistyped a comment in my notes way up above there, which probably attributed to missing this. Sorry! I believe the solution is to check out this deprecated file. This is most likely a leading 0 problem on projid. The other deprecated covariate files are in that same area ([deprecated ROSMAP] (https://www.synapse.org/#!Synapse:syn20682034)).

avanlinden commented 3 years ago

@Aryllen Oh yep, those are them. Thank you! I will get them joined up just for completeness sake and upload a new version.

avanlinden commented 3 years ago

Bulk RNAseq annotations are as complete as I can get them:

Remaining issues:

Moving on to another assay.

avanlinden commented 3 years ago

rnaArray annotations are as complete as currently possible:

avanlinden commented 3 years ago

confocal imaging annotation updates are done:

avanlinden commented 3 years ago

scrnaSeq annotations are done, with one remaining question about diagnosis:

remaining issues: