Sage-Bionetworks / dccmonitor

Allows for monitoring of data uploaded via the dccvalidator application. Functions for getting information on the uploaded files, metadata validation status, and more.
https://sage-bionetworks.github.io/dccmonitor
Other
2 stars 4 forks source link

bug - AD dccmonitor checks out of sync with dccvalidator checks #123

Closed avanlinden closed 2 years ago

avanlinden commented 2 years ago

In reviewing metadata validation efforts for the MC-CAA study in AD, I noticed that dccmonitor is not flagging errors that are being caught in dccvalidator.

We have a contributor who tried validating individual and biospecimen files for an existing study that were missing columns and rows included in the current versions of those files. dccvalidator correctly flagged those errors, but dccmonitor did not for the exact same files(screenshots attached). Need to investigate where dccmonitor is out of sync and fix it.

dccmonitor-checks-MCCAA dccvalidator-checks-MCCAA

@kelshmo @danlu1 Have you noticed any issues with the PEC dccmonitor app?

danlu1 commented 2 years ago

I don't personally run into this issue. But I remember Juliane commented that she didn't see the failures mentioned by the data contributor on her end. I guess this is an issue we need to fix. [image: image.png] Dan Lu(she/her) Bioinformatics Engineer/Analyst, Sage Bionetworks Cell: 206-601-6116 http://sagebionetworks.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsagebionetworks.org%2F&data=04%7C01%7C%7C14a199bea8bc427c118a08d89b278d60%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C1%7C637429941506870273%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6EoKb58z5l8vGNR2%2B%2BOQFYDUjjOMdGP6r%2BBhvNcJyx8%3D&reserved=0

On Tue, Oct 26, 2021 at 10:46 AM Abby Vander Linden < @.***> wrote:

In reviewing metadata validation efforts for the MC-CAA study in AD, I noticed that dccmonitor is not flagging errors that are being caught in dccvalidator.

We have a contributor who tried validating individual and biospecimen files for an existing study that were missing columns and rows included in the current versions of those files. dccvalidator correctly flagged those errors, but dccmonitor did not for the exact same files(screenshots attached). Need to investigate where dccmonitor is out of sync and fix it.

[image: dccmonitor-checks-MCCAA] https://user-images.githubusercontent.com/11965371/138932664-db811d4a-4a73-4d3e-9179-8e62889ff512.png

[image: dccvalidator-checks-MCCAA] https://user-images.githubusercontent.com/11965371/138932667-847650a8-24f1-4332-9fd2-65bf2bea01e3.png

@kelshmo https://github.com/kelshmo @danlu1 https://github.com/danlu1 Have you noticed any issues with the PEC dccmonitor app?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Sage-Bionetworks/dccmonitor/issues/123, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVUKVVNJAC7OSAEULM5YS43UI3SQXANCNFSM5GYMKH4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

danlu1 commented 2 years ago

@avanlinden I will meet with Anthony to discuss this. Just to get more info on this. Have you tried to run dccmonitor in incognito window? And how it goes?

avanlinden commented 2 years ago

@danlu1 I did try running dccmonitor in an incognito and it's still missing the errors flagged by dccvalidator.

In the process of trying to reproduce this error, I did some tests with files from several studies and encountered a rat's nest of other bugs in dccvalidator (e.g. metadata files for existing studies being incorrect, errors for missing columns that do not exist in the template, etc). I can't even appropriately isolate this error because there are too many others at the same time.

I think we need to do some more comprehensive testing with demo files to try and document exactly which checks aren't working and why. :(

avanlinden commented 2 years ago

Screenshot of the dccvalidator errors for the APOEPSC study, which I was using to test:

APOEPSC-dccvalidator-errors

The individual metadata file currently in the study only has two rows; I created a test version where I deleted one of the rows to see if dccvalidator flagged it. It should have told me that one individualID that was previously part of the study was missing; instead it told me that 9 other individualIDs that were previously part of the study were missing. But it should be checking against the current individual file, which honest to goodness only has two individualIDs total.

The files I used to test are in my validation folder here, they're the ones labelled "APOEPSC": https://www.synapse.org/#!Synapse:syn25993502. It's a stem cell study so there is no PHI or sensitive data.

I can do a more detailed debug on this next week.

afwillia commented 2 years ago

@avanlinden It seems that dccvalidator is using the AMP-AD Specimen table as the "truth" for all individualIDs, not the previously uploaded individual metadata file. Does that sound right to you?

afwillia commented 2 years ago

@danlu1 @avanlinden also, I'm getting denied trying to push a branch with some updates. Can you give me access?

danlu1 commented 2 years ago

@danlu1 @avanlinden also, I'm getting denied trying to push a branch with some updates. Can you give me access?

on it.

afwillia commented 2 years ago

Thanks!

On Thu, Jan 20, 2022 at 2:25 PM Dan Lu @.***> wrote:

@danlu1 https://github.com/danlu1 @avanlinden https://github.com/avanlinden also, I'm getting denied trying to push a branch with some updates. Can you give me access?

on it.

— Reply to this email directly, view it on GitHub https://github.com/Sage-Bionetworks/dccmonitor/issues/123#issuecomment-1017980562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6Q2D3QGRDGR3LFVHVXXS3UXCDULANCNFSM5GYMKH4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

afwillia commented 2 years ago

As for the original issue @avanlinden raised with the MC-CAA immunoassay data, I think I figured out why dccmonitor is not flagging the issues dccvalidator did.

  1. "Missing columns in the biospecimen metadata file" dccmonitor differs slightly from dccvalidator in how it finds the template dataset for each biospecimen type in the manifest. In this case, columns with "NaN" were not being turned into true NA's and omitted, causing dccmonitor to skip that particular check. I pushed a preliminary fix for this, but should probably add more testing.

  2. "Some individualIDs/specimenIDs missing from the individual/biospecimen metadata" dccvalidator::check_all() has an argument for the specimen table AMP-AD specimen table which is used to identify all of the individualIDs and specimenIDs for a study. dccmonitor does not pass a sample_table to check_all(), so that always gets skipped. I will need to figure out how to add the sample table to dccmonitor to make it consistent with dccvalidator

avanlinden commented 2 years ago

@avanlinden It seems that dccvalidator is using the AMP-AD Specimen table as the "truth" for all individualIDs, not the previously uploaded individual metadata file. Does that sound right to you?

Oh you're totally right, I forgot about that. There have been issues updating the specimen table (which I thought I had solved, but maybe not). I've been trying to hold everything together with scotch tape and chewing gum since Nicole left so clearly some things are falling through the cracks...

avanlinden commented 2 years ago

@afwillia the code that generates/updates the Specimen Table is here: https://github.com/Sage-Bionetworks/cleanAD/blob/master/inst/scripts/generate-table-ad.R. There's a docker image for that repository, and there's a cron job in a service catalog EC2 that's supposed to pull the docker image weekly and run the script to update the table, but the cron job hasn't worked since August for no good reason that I can find so I just run it manually when we get new specimens.

The latest thing to go wrong with the specimen table is that somehow we got rows with duplicate individualID, specimenID, and assay, but one of the assay values is JSON string formatted for a multi-value annotation (e.g. "CITEseq" and "["CITEseq"]", example here. I haven't even had time to dig into that one.

afwillia commented 2 years ago

@afwillia the code that generates/updates the Specimen Table is here: https://github.com/Sage-Bionetworks/cleanAD/blob/master/inst/scripts/generate-table-ad.R. There's a docker image for that repository, and there's a cron job in a service catalog EC2 that's supposed to pull the docker image weekly and run the script to update the table, but the cron job hasn't worked since August for no good reason that I can find so I just run it manually when we get new specimens.

Thanks @avanlinden you read my mind on how to update the specimen table. It seems to me that there are two different issues that might be worth splitting up.

The first is the dccvalidator/dccmonitor sync that our collaborator discovered, which is this current one. I think I've finally nailed the scope and issues for that mismatch and can possibly push a fix next week.

The second issue is keeping the specimen collection table up to date. I have no idea what this entails yet, but will read up on it. I'm also not sure if it's the source of those other dccvalidator errors in the original issue.

Possible third issue are the bugs and discrepancies you encountered trying to reproduce the APOEPSC study last week. If so, maybe fixing 1 and 2 will resolve this. Maybe not.

So moving forward I propose closing this issue once I push a fix for the original sync issue. Then opening another issue for this specimen table sync problem, which will inform a possible third issue.

How does all that sound to you? A lot of this is just to make sure I've understood all that's happening. I think this all ties into #https://github.com/Sage-Bionetworks/dccvalidator/issues/524 @avanlinden @danlu1 @ychae @milen-sage

danlu1 commented 2 years ago

@afwillia AD doesn't have sample_table in dccvaliator config file. It should be samples_table: "syn21578908" as dccmonitor.

afwillia commented 2 years ago

@afwillia AD doesn't have sample_table in dccvaliator config file. It should be samples_table: "syn21578908" as dccmonitor.

Interesting, it looks like the dccvalidator was using this same sample_table from the default config, even though amp-ad was specified in app.R.

avanlinden commented 2 years ago

@afwillia Your approach outlined above sounds good! Thank you for your help here.